So you have this interesting concept, idea or burning question that needs to be answered. You can get your hands on the data readily enough, and when you do it’s really a bit of a mess… No fields really correlate, there’s some typo’s and it’s not really in a format to ‘mash’ together with anything else logically… Maybe you can engage your IT teams to help you ETL it into the Data Warehouse… maybe your own SQL sills aren’t what they used to be. Or it might not be valuable enough to warrant your specialist teams time…or it might be, you just won’t know unless you dig a little deeper and discover those nuggets of information… So how do I get my data sets workable for a quick analysis?
Enter Project Maestro. Maestro is designed to help you with your Data Prep in a user friendly, drag and drop, non ‘super-technical’ way for analysis within Tableau, so grab a copy of the Beta now!
…and yes note that as of December 2017, it’s in Beta. That means everything you see will change, and since we’re at Beta 1 a lot will change between now and it’s release date. As a result, be mindful of how you’re using it, plan on implementing it, supporting it etc…
Before I go on, you really need to check out the demo at the Tableau Conference 2017 by Anushka Anand to get an overview. Anushka’s session starts at around ~54:00 minutes in.
Some key features of Maestro:
- It’s visual so you can get immediate feedback on the results
- It has the look and feel of Tableau so it’ll feel just like building a Tableau Viz.
- It’s tightly integrated with Tableau so you can keep in the flow of analyzing your data.
So where were we…
A friend and I were musing over the crazy property prices in Melbourne and what it would take to afford to purchase something inner city. He joked, “…you’d have to resort to a life of crime to afford one…” and “…they must all be criminals living in the inner city...” 🙂 So naturally this led to:
- Which inner city suburb has the highest crime?
- Whats the most common crime or offence types in the more affluent inner city suburbs?
Where do we start? Well, the data wasn’t all that hard to find. Firstly, property prices that were sold within the last year or so is readily available from Kaggle.Com and secondly, what better site to source crime data for Victoria than the Crime Statistics Agency of Victoria at crimestats.vic.gov.au . So I got to downloading the respective .CSV’s…
As you can imagine, having data from two totally separate data sources means the data doesn’t really match, but we have something, suburb names of inner-city Melbourne. So we need to do a bit of data prep.
So after starting Maestro, the first data set I imported was the Melbourne Housing Data from Kaggle and then the Crimestats Data by opening and connecting to the CSV files:
Maestro samples the data and gives you an instant snapshot of your data set.
I want to look at relatively recent data only as I don’t have a complete set for 2017. So last year’s (2016) will do just fine. We have Date fields in both data sets so this is easy to filter in Maestro – simply choose the “Melbourne_housing” data, right click the Date field and choose “Filter Field”.
Maestro has a calculations field similar to Tableau’s to keep you in your flow, and in the example below I’m only looking for the Year 2016 in my house pricing data. Build a boolean calculation to capture just the 2016 Year.
A filter is now applied to only look at 2016 data:
Choose “Add Step”
Maestro then gives me a visual snapshot of my data with the results…in real time…
To compare the same years (i.e. 2016) we now need to adjust our Offences/Crime data set to reflect 2016.
Here, I’m going to Add a Step and then perform some filtering on that field. Basically i’Il ask Maestro to change the format to Date and then Keep Only 2016 data.
So lets “Add Step”
..and then convert the numerical values from ‘2,016’ to a Date and then “Keep Only” that Year (2016).
So now how do I correlate the two data sets?
The common field between the two data sets is the Suburb field. So lets get Maestro to Join on the two fields. You can simply drag the connector to the “New Join” function:
…however after the join you’ll note that nothing shows up as per below.
The reason is that Maestro has chosen the Record_ID field as the default Join clause. This won’t work for our situation, so change that to Suburb.
Lets change the “Applied Join Clause” to Suburb and perform a Left-Inner Join. I want to bring in everything from the Melbourne Housing Data and only the suburbs that match from the Crime data:
Note the Join Clause is now Suburb.
The summary of join results shows we have some data excluded. It’s worth checking what they are before we continue – it might be nothing important, but it might be something we need…
Maestro gives you real time feedback on the excluded items. So by clicking on the Excluded entries, the display updates to show what’s been excluded in the join.
Note the Join Result is ~400,000 rows! This is expected as the join is on Suburb and I have a data set that has 161,000 rows that needs to correlate with each other in a meaningful way (i.e. the join). No real problem for Maestro and Tableau of course, but something you need to be mindful of as you analyse your data – especially if you’re working with very large data sets. We’ll deal with this a little later.
A quick glance through the Suburb data shows that these are mainly country Victoria suburbs, and this is okay as I’m only interested in the inner city suburbs – so no real concern for us and certainly not a show-stopper.
By choosing the Unmatched Rows in the Included field (i.e. the 8,406 value), I can see that some fields didn’t match – this is a good area to start looking for anomalies in the data, like below…
This one looks pretty straight forward. It’s safe to assume that ‘WEST FOOTSCRAY’ is recorded as ‘FOOTSCRAY WEST’ in the other data set… so lets change that.
…and we have a Match! Note how the item disappears as it is now not ‘unmatched’. Maestro allows us to edit the values directly! Very cool…
The ‘IVANHOE EAST’ Suburb does not show up in the Offences data set, whereas IVANHOE does. So for our scenario and the following analysis it’s safe to combine that with ‘IVANHOE’. Simply edit the value and change ‘IVANHOE EAST’ to ‘IVANHOE’ – and we get a match.
So from here, we can do a quick check of the data and see how it looks. Simply right click the Join and choose “Open Sample in Tableau”. This will create an extract of the data at the current point in the workflow so you can take a quick look in Tableau – Very cool…
Now as mentioned before, I have nearly half a million rows from a relativley small subset of data. Since we have thousands of addresses as well as different combinations of offences, offence types etc, the join process needs to correlate all of these together. I won’t go into too much detail here, but sometimes this can be useful and sometimes unavoidable especially when working with disparate and unrelated data sources. However, with Tableau and also with Maestro we can manage this… Click “Add Step” and choose “Aggregate”:
Since for this analysis we don’t need to bring in every piece of data (i.e. address level detail), we can exclude some fields such as the Addresses. To help us overcome the issue with all the rows.
So Lets just bring in:
- Suburb, Price, Offence Count, Record-ID, Offence Division, Offence Group, Offence Subdivision and Postcode
Because we’re not bringing in every field (especially the ‘Address’ field) we have significantly less rows – we’re working with more aggregated data as opposed to the address level detail.
You can do a quick Sample check in Tableau from Maestro. Click the ‘+’ sign and choose output. Maestro gives you the option to output to a Tableau Data Extract or the new Hyper Extract format. Here i will choose Hyper (note as of December 2017, you will need the Tableau 10.5 Beta version to open Hyper Extracts).
Once saved, open your Extract in Tableau and you can then begin analysing your data and get to answering those questions much more quickly.
I have excluded Melbourne (postcode 3000) for the analysis as it skews the values (lots of offences here!) – I only wanted to look at the surrounding inner-city suburbs. So what are some of the high level insights?
- No surprise that the South Eastern Suburbs are the priciest of suburbs here in Victoria…
- For 2016, property and deception related offences were the most prevalent.
- The inner city suburbs of Preston, St. Kilda and Sunshine show the highest amount of Offences in 2016 (excluding central Melbourne), with Preston a real stand out.
- Inner city expensive suburbs (>$1.5m median) such as Kooyong, Eaglemont, Canterbury and Kew/Kew East recorded the lowest reported Offences.
- In these suburbs, the most prevalent Offences in 2016 were:
- Stealing from a Motor Vehicle
- Residential non-aggravated burglary
- Breach Family Violence Order
So, in a nutshell the Inner South Eastern Suburbs are a great place to live if you can afford it of course. Get a place with off-street parking or even better a secure garage and setup some good old security camera’s, home alarm systems and a big guard dog!