Part IIa – Market Segmentation for SaaS using CHAID

In this article, now split over two posts, I’ll go through a quantitative technique I’ve used before for creating market segments. This takes place in two stages: firstly, what characteristics of my customers lead to them behaving in a certain manner and secondly, how do these characteristics form in to clusters?

This post was originally going to cover the whole topic of market segmentation using CHAID and then clustering using Kohonen Networks. However, it’s getting far too long as a single article, so I’ve now split it in to two parts, IIa and IIb. Unfortunately, this now breaks my model of using the Led Zeppelin albums I-IV as the images, so I’ll have to think of something else for part IIb.

The two parts that I cover can be further described as:

  1. From what I know about my customers, which variables (Age, Income and so on) are predictive of behaviour? You many know the shoe size of all your customers, but does it have any impact on their willingness to stick with your online backup service? Probably not.
  2. Given that I now know which variables are interesting, how do these naturally form in to clusters? Is there a natural grouping of “Older, richer” customers or “Tech-savvy, web developers”?

To carry out any of this work, you’ll need some data about your customers, or the market. The first you can pull out of your CRM system, the second you can find from a market survey. If you have neither (you don’t have the first, because you’ve never offered your service before, and you don’t have the second because you don’t have the time or money to run a survey), then there’s a risk that any segments you create will be based too heavily on gut and “What sounds right” – and you’ll certainly struggle to size these segments in any way. I’ve written a post here about this – really what you end up with here are vague personas with no evidence that they are founded in reality. Furthermore, if you want carry out the first step (discovering which variables predict behaviour), then you’ll also need some historical behavourial activity for your group of customers or, possibly, survey recipients. Though note, this is in fact extremely difficult for survey recipients. If you’re interested in the question “What characteristics are indicative of whether customers will stick with my online music portal over time?” then it’s hard to get that from a survey – even if you ask people “How likely are you to stick with an online music service long term?”, you’re only measuring their attitude to that question and not actual behaviour. So really what you need is some customers who have already been using your service for a while and for whom you know how they behaved over time.

But let’s assume we do have all of this magic data! And I’ve certainly been in this situation a few times before. Below I describe the process for going through these two stages, using an example dataset, to show that it can really work. The dataset is for customers who took out a loan with an initial tie-in period. The behaviour we’re interested in is “Do they stick with the lender after the initial tie-in period has expired, or do they port their loan elsewhere?”. Translating this to a software environment, this is similar to the question “If we sell software with an initial 1 year support contract, how likely is it that that company will buy a new support contract after one year?” (i.e. that they won’t churn). Of course a full SaaS model is asking something further – “How likely is it that a customer will churn during some given period?”

Step 1 uses a standard algorithm (CHAID). There are various products that provide this functionality – I use SPSS in this example below. I’ve also just spotted, on the CHAID Wikipedia page that there’s a download for users of the R statistics package – find it here. For step 2, in the next post, I use a Kohonen Network for creating the clusters. I’ve provided an application for doing this clustering job, though it’s very rudimentary! When I get time I’ll add the CHAID functionality in to the app as well, so that you can carry out both stages in one go.

Find the Predictive Variables

The details of the CHAID algorithm can be found elsewhere. To all intents and purposes it’s a method that can be used to determine how good variables are at explaining some outcome (or dependent variable). E.g. does someone’s age predict their liking for a particular TV programme? Throughout this post I’ll use an example dataset that can be downloaded here:

FPM Sample Data

Unzip this download and extract the CSV file to a convenient location. The format is obviously CSV and the file contains 5,000 rows of example data, with a header row containing column names – a pretty standard format that can be imported in to SPSS and my own app. Then follow the steps below to use CHAID in SPSS on this dataset:

Using SPSS to Create the CHAID Model

First, click to open SPSS Statistics from IBM
Once opened, find the newly downloaded sample data file, by clicking on More Files..
Select CSV files and find the downloaded file 

In the subsequent wizard, the only step where you don’t need to just click Next> is step 2. Here, make sure you tell SPSS to pick up your variable names from the first row in the file.

Once you’ve gone through the whole wizard, your data will be shown in SPSS as displayed above. It’s also probably worth saving your dataset as a standard .SAV file at this point too (under the File.. menu).

The next step is to actually run the CHAID analysis.

Select the menu item Analyze->Classify->Tree..
This presents you with the Decision Tree options as shown.

At this point, it would probably help to explain the example dataset a little. The set contains 5,000 rows, each row representing a customer (data is taken from a B2C context). The customers are all people who took out a loan with an initial tie-in period. I.e. for the first year they had to stick with that lender, paying back the regular monthly payments, but after the year was up, they were free to transfer the loan elsewhere – i.e. churn.

The last column in the table, Stayed indicates whether or not, 2 months after the end of the first year (i.e. 14 months after the loan was first taken out), they stuck with the lender or not.

The other columns represent information about the customers, from CRM records, and other referenced data. NB: This is a mixture of real data and made up data, created somewhat artificially for this exercise – so don’t worry about the actual values, more the principle and the structure.

These potential explanatory variables are of different types – some are nominal binary values. For example, FirstPurchaseProductA indicates whether the customer’s initial purchase with the lender was product A, yes or no. Other fields are scale variables – LoanAmount indicates the numerical amount for the loan taken out.

The purpose of CHAID is to indicate which of the explanatory variables are useful in predicting the value of the dependent variable, in this case Stayed. To set this scenario up in SPSS, move the Stayed variable in to the Dependent Variable box, and all of the others in to the Independent Variable box:

Make sure the Growing Method is set to CHAID, then click OK. You’ll see the output from the process looking like the following:

This first bit of output shows under “Independent Variables Included”, the variables that the model selected as predictive of the output variable, Stayed.
(click diagram to enlarge and make readable). 

This second diagram is the primary output from CHAID. The first split shown, uses LoanAmount – this is saying that, given this particular dataset, the algorithm estimates that LoanAmount is the field that best allows you to estimate whether a customer stays or goes. Looking at the splits (the 2nd row of boxes), these show the algorithm creating 5 subgroups: <=5222, 5222-7568, 7568-10725, 10725, 17164 and >17164. If you look in the boxes, at the % scores for the two categories (0 = didn’t stay, 1 = did stay with lender), then for the low LoanAmount values (e.g. <=5222), a lot more people stayed with the lender (84.7% of cases here) than in the high LoanAmount categories (e.g. >17164, where only 51.8% of customers stayed). This makes sense – if you have a very high loan amount, then your repayments will be higher and it will be worth your while shopping around at the end of the tie-in period. I.e. you are price-sensitive because of the high repayments. If your loan is for a small amount, it’s probably not worth your while (and the transfer fees) to move the loan amount – you might as well stick with the lender and pay off the rest of the amount.

The algorithm has picked out this effect from the behaviours shown in the dataset, then additionally choosing IncomeRatio, CollateralValue, Age, CalcRat2, PreviousLoanRate, RiskScore, SourceGoogle as interesting fields as well.

There’s a couple of other tips to give here, both of which are common sense, but worth repeating:

  1. Make sure the predicted variables make sense. I understand why LoanAmount would impact behaviour, but you need to have this understanding for all fields. It’s possible to see all sorts of strange effects from invalid or skewed data, so it’s important to be certain of the results.
  2. Make sure the dependency is working in the correct direction and the timing of the data makes sense. For example, if you included a field such as “Customer phoned up to redeem his loan”, then obviously this is going to be predictive of the customer leaving the company. But – it’s not a piece of data available when the customer first joins and so shouldn’t be used as an independent variable.

The next step is to then create market segments based on these variables. At this stage you have two choices:

  1. Use the classification model created by the CHAID algorithm.
  2. Use the information you’ve gained on important variables to create more natural clusters in your data.

For the first of these – SPSS has nice features that allows you to export the classification rules you’ve created. In the initial Decision Tree dialogue, click on Output…, and this allows you to select an output format for your rules:

Here, we’re selecting to output the rules as SQL statements:
You can then use this SQL to create the classification rules in your database

This method does give nice simple rules that you can use. However, in my next post, I will also describe an alternative that uses a Kohonen Network to find naturally occurring clusters in the data. Specifically, it’s useful for finding patterns in your customer dataset where particular fields correlate together. For example, do high LoanAmount values tend to correlate with younger people? Or do certain types of people tend to buy particular products on first purchase? This is what an unsupervised clustering algorithm can help with. NB: I wrote another post briefly mentioning the difference between supervised and unsupervised learning algorithms, though mainly for a slightly ham-fisted analogy. There’s a lot more information in the Wikipedia articles referenced in that post.

I’ll describe the use of Kohonen Networks to create these clusters in my next post. This uses an app that I’ve written, mainly because I’ve never quite figured out how to do what I want in SPSS 😉

Read More