I’ve been playing with Microsoft’s Azure Machine Learning Studio (https://studio.azureml.net/), this last month. We’ve been doing some analysis of a recent customer survey, and wanted to run some cluster analysis. Not rocket science, but equally, not something particularly do-able in Excel. So what should we use? SPSS? R? SAS?
Azure Machine Learning Studio (AMLS for short) is a newcomer – the name sums it up pretty well. It’s a Studio (so, not a language like R), it is for Machine Learning (not just for basic stats) and it’s cloud-based (not a downloadable). Oh, and it seems to be very cheap/free (I managed to do all my work in the free tier). But the question is – can a humble marketer grapple with it to get a real job done? Not just a “Hello World” app, but something useful?
My experience with this sort of project, is that there are three things that stand in your way of getting to a useful result:
- Knowing what you’re doing What the clustering algorithms actually mean, when you’re allowed to use them, interpreting the results correctly, that sort of thing.
- Data-mulching. I’m not sure if this is a real phrase, but it’s the task of struggling with the data sent to you in some horrible format (Excel 97 spreadsheet with missing values, unusable column names, ambiguous answers and so on) and turning it in to something neat to be used by your algorithms. In my experience 80% of the elapsed time on a project is data-mulching.
- Learning the tool. Okay, your data is clean, you know what algorithm you want to use – but you’re not there yet. Do you know how to load that data in to SPSS? How to pull out a subset for training in R? How to cross-validate in SAS? You have to learn to the tooling too.
Where AMLS excels is in (2) and (3). For item (1) – there’s information about what the various algorithms do, videos and background information, but these are no substitute for a decent course in ML, or Data Science! And nor would Microsoft claim it to be. They do offer some courses in this field (e.g. https://www.edx.org/course/principles-machine-learning-microsoft-dat203-2x-4), and obviously you can learn from lots of places. But if you don’t know your SVMs from your ANNs, you’re not going to get that here.
That’s fair enough – so what about (2) and (3)? I’ve had terrible trouble with both in the past. I’ll take each in turn.
There are two real problems with data-mulching – firstly the endless idiosyncrasies of different datasets. The format (CSV, other delimited types, SPSS format, Excel formats, problems with escape characters etc. etc. ad nauseam) is only the start of it. You then have issues of missing data and the multitude of methods for dealing with it, classification of manually entered data (or free-text analysis), combining of multiple datasets (with lookups) and so on. But then, the second problem is the iterative nature of fixing these things. You think you’ve fixed that problem with classifying a particular column in to groups (say), then you realise you’ve missed something and have to do it over. The whole thing can be very time-consuming.
Microsoft have recognised this pain and provided lots of easy to use tools for cleaning and processing data. On the left hand side of the studio are a variety of tasks that come out of the box, each with a series of parameters that make them easy to figure out. Here are just a few of the data manipulations you can achieve very easily:
The great thing about these is how easy they are to use. Not only is there likely to be something that works for you instantly (e.g. replacing missing data with zeroes, or average values) but then, when it doesn’t work first time round, you can just fix it and run again. You’re not re-processing data, reloading it and so on. The studio works by allowing you easily link different tasks (like these) together – and it deals with the complexity of data formats, and the right things going where they should. For example, if you want to filter your data (for example, to remove invalid rows) – you can apply a transformation like one of the above mid-process. But if you realise later on that “Oh, I shouldn’t have filtered some of those”, you can easily amend the transformation and just re-run. It really is very easy. And the best thing I found was that, if you do want to do something particularly hairy (for me, I wanted to filter out rows based on sum of a number of different columns) – there’s the “Apply SQL Transformation” task which lets you write SQL against your dataset very easily. Great!
The best thing about these being easily swappable task modules is that it makes the job of iteratively figuring out your data cleaning and mulching problems much much quicker. The experiment that I ended up with (see below) is the end result of lots of swapping in and out of various ideas and modules. I’ve ended up using some quite simple SQL Transformation and Data Splits – but it took a while to get to that simplicity:
As can also be seen in this diagram, I’ve actually tried both the classic K-Means clustering algorithm, and some Hierarchical Clustering too. Did I have to re-create my dataset for each? No! Adding in alternative methodologies is very simple. NB: I would re-iterate my point (1) however – one small issue with this sort of easy-to-use app is how easy it is to do, say, a K-Means Clustering of a dataset. But is the input data valid? What does it actually mean when you’re doing this clustering (all too easy to over-ascribe meaning, or under-ascribe meaning to clusters)? And so on. As can also be seen here though, I also managed to run an iterative process to “Test the optimal number of clusters for this dataset”. Something which in other applications has been a real pain – this was as easy as linking together out-of-the-box templates and hitting Go. Really impressive.
The second point is about learning the tool. This is really tough in this world – these are complex processes. It’s very difficult to create a drag-and-drop interface that allows you to cover the vast array of Machine Learning methodologies. For me this is about knowing your target audience, and catering for that audience. I’m not a full-time data scientist. I don’t have 3 months to re-learn SAS (I knew it once, in a bygone age…), and I want results quickly. But, it needs to be powerful enough to do something more than basic stats – clustering, neural nets, Bayesian algorithms, that sort of thing.
Again, I think they’ve managed this. There’s a lot of UX work that’s gone in to this app – it’s pretty trivial to drag things across, and figure out what goes where. It’s not perfect, still a few bits that could be easier. But for someone who doesn’t want to spend weeks learning something, the learning curve was pretty shallow. R is an extremely powerful language, but there’s no way I would have got to these results as quickly.
And that last point is key as a marketing person. My objective was “I want to quickly generate solid clusters from this (pretty messy) dataset and have confidence in them”. Not “I want to change careers in to being a data scientist”! When you’re trying to draw conclusions quickly, but still want the results to be robust, well presented and explainable, the tool performed extremely well. Sure, if I was working at Google on Skynet (or whatever it is they’re doing), I’d probably have something more advanced. And if I just wanted to work out the standard deviations of a few columns, AMLS is unnecessary. But for the sorts of tasks a marketing person might be interested in – clusters, segmentation, finding relevant vs irrelevant customer attributes, churn calculations – it’s great. And note when you do want to go to the next stage there are ways of adding in Python and R modules which I’m sure will satisfy the more advanced data scientists out there.
Overall, great job Microsoft!