Part IIb – Market Segmentation using your Customer Data

This is the second half of my previous article on market segmentation.

So, we’ve used a CHAID model to figure out which variables in our input set are predictive of behaviour. And at the end of the previous post I described a way of outputting the simple models that CHAID produces from SPSS.

However, an alternative approach is to try and find naturally occurring clusters in the variables that you want to use. NB: If you are in a situation where, yes, you have some data about your customers (or the market generally, from a survey), but you don’t have any behavioural outcome data (either because it’s survey data, or because you’re just starting up your new service), then you can still use the approach defined below to find clusters, something you can’t do with the CHAID approach. However, there’s the shortcoming that you’ll be including variables which may not be very interesting and may be misleading. For example, if you were offering an online email management system, you might have some survey data about your customers including things like industry, number of employees, location etc And there might be some nice natural clusters in this data (e.g. that oil companies tend to be large!), but that doesn’t mean that industry is something that makes any difference to whether that company needs an online email management system – it’s an interesting variable, but not necessarily relevant if you want to create predictive and useful clusters.

Still, that aside, I’ll get on with the process. There’s lots of info out there on Kohonen Maps, or “Self-Organizing Maps”, including:

  1. A Wikipedia entry: http://en.wikipedia.org/wiki/Self-organizing_map
  2. A nice tutorial here: http://www.ai-junkie.com/ann/som/som1.html

Essentially, it’s a way of clustering data that allows you to reduce the dimensionality of that dataset. What that means is that if you have a list of customers with a large number of interesting variables (in the example in the previous article, we had 8, from LoanAmount to Age), then it allows you to represent those 8 dimensions in, say, 2 dimensions, on a grid. The other advantage it has is that clusters that are near to each other are more similar to each other. An example taken from the second link above (the SOM tutorial) illustrates this well. The following is a SOM representing world poverty:

This has been built up from a large number of input variables (health, nutrition, educational services and so on), but the SOM allows you to represent that complex data as a 2D map, with adjacent countries (or clusters) being most similar to each other.

To try and create something like this with our customer data (from the previous post), I’ve put together an extremely basic app – below – to run the Kohonen clustering algorithm on a simple dataset (the same set as before, included here again):

Download app from Dropbox

Once downloaded, unzip the folder and place the app (just a simple .exe) and the data file somewhere on your machine, ready for use.

The following explains how to use this app to create Kohonen clusters for the example dataset (and therefore your own). Note however that this app is very flaky – it was written for my own use, and will crash if you click the wrong button, if your data isn’t in quite the right format, if you haven’t got the appropriate pre-requisites on your machine, or even if there’s a “R” in the month. I’ll make more fixes to the code as needed, but if anyone wants to take the code for their own use, it can be found at:

https://github.com/benjrees/FPMClustering

Anyway, back to the process:

1. Load the app

2. Select your input data file (in this example, the downloaded example dataset)

Click where shown, then select the downloaded CSV file:

As can be seen, this amends the SQL query shown, to insert the name of the selected file. The purpose of the SELECT query is to allow you to easily select subsets of the input data (with a WHERE clause) without having to manually split up data files.

3. Choose your input columns

This is probably the worst UI I have ever had the mis-fortune to be involved with. But, as I said, it was for my use only. At this stage, we need to choose a subset of the loaded columns to use as input columns for the clustering process. I.e. what are the variables that we’re interested in? To do this, we click on the “Choose Input Columns” button, to display the following – where you should select the relevant columns by checking boxes:

NB: The drop-down box at the bottom of this dialogue is a (completely unlabeled!) option that allows you to select a numerical column indicating frequency – if you’re presenting a row n times each, then use this drop-down to select the relevant column. This is used rarely.

When done, click OK, to return to the main panel.

4. Choose an output folder for the results, and save your config.

Click the yellow-ed button to select and save an output folder, then click on the File menu at the top to save your config (as an XML file).

You’re now ready to run the clustering algorithm. There are, of course, a lot of other options you can adjust here, mostly parameters of the network algorithm. I’ve set the defaults for the app to “What I normally use”, so it should work in many scenarios. However, the one you are likely to adjust is the value “Width” (second item on the right hand side). This determines the number of clusters that you’ll produce. However, because the algorithm produces a square grid of clusters, this value is the width of that grid such that the final number of clusters is this value squared. Here, the Width is set to 3 which will lead to 9 cluster definitions (market segments).

5. Run the clustering process

Clicking the blue Run button, will start the clustering algorithm. This goes through three stages:

  • Load all of the data
  • Iterate through the algorithm, forming the clusters
  • When complete, output two files, one for the cluster definitions and one for the input dataset with the cluster definition appended (see below)

NB: The numbers at the end of the dialogue (starting “351, 666..” here) indicate the number of rows in each cluster. I use this as a quick sanity check to see if everything has run properly (when something has gone wrong with the algorithm, you often end up with very skewed clusters, perhaps with almost all rows in one group).

The two files output by the algorithm are:

  • clustersn.csv – a file containing the definitions of the new clusters (essentially just a list of the variable values that define the centroids)
  • outputn.csv – as mentioned, the input dataset repeated, but with the assigned cluster value added at the end (in a column labelled Cluster_ID)

What can be interesting at this point is to look at the clustern.csv file and see exactly how the algorithm has decided to create the clusters. It’s often the case that certain input values will be used heavily to distinguish different groups (with this example dataset, the different clusters have very different values for LoanAmount), and others won’t be used extensively (e.g. Age with this set). 

Anyway, you now have some cluster definitions that you can used to describe each customer segment. For example, if you have a cluster with a high LoanAmount but low IncomeRatio and younger in Age, then you might call this group “Highly Stretched Young” or something similar.

This has been a very whistle-stop tour of how to use Kohonen Networks/SOMs for creating naturally occurring clusters in a set of data. There is far more to the algorithm (not only how to change the parameters for different situations, but also why you would use this method instead of any other), which I might try to cover at some future point. However, for now, we should now have created quantitative market segments for our customers – either with this method or with the CHAID method described in the previous post.

The next stage, in the next couple of posts, is to work out how much each segment is worth!

Read More