Churn Modelling with Random Forests
- Terence Shin
- Nov 22, 2018
- 7 min read
Tools: KNIME, Excel
Skills/Methods: Clustering, Random Forests, Pivot Tables
Overview
Our goal is to analyze and determine the cause(s) for customers to churn from an internet service provider (ISP). Churn is defined as the percentage rate of customers that unsubscribe from a service, in this case, an ISP.
This can provide meaningful insights, as ISPs can pinpoint why their customers are leaving their service and can formulate plans to retain customers that are about churn. Businesses can also use these insights to adjust their business models to mitigate the reasons for customers to churn.
TLDR: I created a Random Forest Model using KNIME and determined that the root of the issue is that fiber optic internet customers have a much higher propensity to churn when they don't have online security, don't have tech support, have a month-to-month contract, or a combination of the three.
Data Understanding/Cleaning
Looking at the ispcustomerchurn.csv file, we are provided with 7043 rows of data which pertains to customers’ demographics, their types of service, their payment methods, their monthly and total charges, and whether each customer has churned or not.

By looking at the statistics, we can get a better understanding of the variables in our dataset in terms of data type, distributions, and mean and variance for numerical data. In the dataset, there are three numerical variables (tenure, MonthlyCharges, TotalCharges), while the rest are nominal attributes.
Another thing to keep note of is that TotalCharges has 11 missing values, which is something that we may need to take into consideration when developing our model; Otherwise, the rest of the data appears to be clean as there are no other missing values.

I also wanted to take a look at the correlation matrix between all of my variables to see if there are any redundant variables or duplicates. I noticed that TotalCharges and tenure have an almost perfectly positive correlation, which makes sense intuitively as they both increase linearly as time increases. However, since they do not necessarily mean the same thing, it is not something I will fix immediate and instead keep that in mind for later.
Model Selection
The next step is determining which model to use that is best suited for this dataset. The model selected should be able to obtain a high prediction accuracy and should also be able to tell us which attributes are the most important in determining the likelihood of a customer churning. The chart below outlines the pros and cons of the different models that can be used.

Decision: Random Forests (classification)
Random forests are an enhanced version of decision trees that uses ensemble learning which gives it stronger capabilities in terms of predictive power compared to decision trees. Random forests work very well for this particular problem for many reasons.
One, it is very effective when there are a large number of attributes, and in this case, there are over 20 attributes in this dataset. This is because random forest models require no input preparation: binary, categorical, and numerical features can be used without having to scale any attributes or convert one data type to another. Additionally, it does not require the need for the selection of splitting attributes.
Two, it’s very convenient in the sense that it accounts for missing attributes, does not require cross validation, and creates a model with low variance and low bias (overfitting). This means that we can disregard the 11 missing values for the TotalCharges variable moving forward.
It’s also particularly useful in this scenario because it provides information on the predictive power of each attribute. This is useful for us since we are trying to determine the factors that cause a customer to churn.
Random Forests: Preliminary Model

I first started off by creating a simple Random Forest model in KNIME using all variables and before making any adjustments in my dataset to determine a benchmark accuracy. I partitioned my dataset by 80% and 20% between the training set and testing set respectively. In terms of parameters, I set the number of models (number of trees) to 10,000, set “no limit” in terms of the number of levels (tree depth), and used the Gini Index as the split criterion. After running the model, I received the following confusion matrix.

Currently, the model has an accuracy of 78.992%. Based on the results, we can determine the importance of each variable by looking at the attributes with the most splits at level 0. The variables with the most predictive power (from highest to lowest) are:
1) customerID (1999 splits)
2) Contract (1676 splits)
3) OnlineSecurity (1232 splits)
4) tenure (1176 splits)
5) TechSupport (1087 splits)
Another metric that is perhaps more accurate that we can use is a split/candidate ratio for each level and summing them, which would put less weight on the lower levels and vice versa.

According to this metric, the top five variables in terms of predictive capabilities remain the same, but tenure has more power than OnlineSecurity.
Random Forests: Model Refinement
To refine our model, we can focus on attribute selection as well as the parameters of the model.
I tried fiddling around with the parameters of the random forest model (split criterion, tree depth, and number of models) and found that the Gini Index and having an unlimited number of levels resulted in the highest accuracy percentage given 10,000 models.
Looking at the top 5 most predictive variables, customerID had the most splits at level 0 and also had the highest sum of split/candidate ratio. However, intuitively speaking, I believe that there’s no correlation between a customer’s id and churn, and thus it should have no predictive power. After removing the attribute “customerID” and rerunning the model, we have the following results:
Model without “customerID”

The revised model has an accuracy of 80.412%, almost a 2% improvement from the initial model. It’s also worth noting that the false negative percentage error also decreased significantly, which is important as it is really costly for an ISP to assume that a customer will not churn when, in fact, they are.

Looking at the chart above, our model shows us that internetservice replaced CustomerID as one of the top five attributes in terms of predictive power along with tenure, onlinesecurity, techsupport, and contract.
I tried removing the variables paperlessbilling (since payment method explains the exact same thing) and gender (because I believed that it would have no predictive power) but removing one or the either or both lowered the prediction accuracy. Therefore I have decided to keep both attributes in my model. I also tried removing TotalCharges since it was highly positively correlated with tenure, but that also lowered the predictive power of the model.
Furthermore, I created a new variable called PhoneAndInternet (yes if the customer has both phone and internet service, and no if otherwise) to see if having multiple lines increases retention. However, when including it in my model, it showed little predictive power and lowered the model’s overall prediction accuracy.
Post Analysis
After developing my model. I wanted to dive deeper into the five main attributes: tenure, onlinesecurity, techsupport, contract, and internetService. Previously we already determined that the higher the tenure is for a customer, the less likely they are to churn.

Using Excel, we can also see that not having tech support or online security increases the likelihood of churning by 26.5% and 34.4% respectively. Intuitively, this is justified because if a customer has a bad experience because of a lack of tech support when needed and/or a security breach due to a lack of online security, the customer may want to find other options that provide a better experience.
In terms of contract type, the churn rate for month-to-month contracts stands at 42.71%, while the churn rates for a one-year and two-year contract are 11.27% and 2.83% respectively. This also makes sense, as a customer with a month-to-month contract has twelve opportunities throughout the year to churn away to a different ISP with a better deal, unlike the other two contract types.

Looking at the attribute InternetService, the churn rate for Fiber optic customers is significantly higher at 41.89% compared to DSL customers or no internet customers at 18.96% and 7.4% respectively. Intuitively, this doesn’t make as much sense as the other attributes, so I took the analysis a bit further.


It is clear that fiber optic customers are much more likely to churn when they don’t have online security (49%), when they don’t have tech support (49%), and/or if their contract type is on a month-to-month basis.
What this may suggest is that fiber optic customers are churning because, as stated before, they have 12 opportunities throughout the year to cancel their service and find another ISP. However, while there is a high correlation between fiber optic customers with a month-to-month contract that churn, I doubt that this is the root of the problem.
Focusing on fiber optic customers with a month-to-month, if they have both online security and tech support, the churn rate drops to 28%. However, if they have neither online security or tech support, the churn rate more than doubles to 61%.
Thus, the root of the problem lies in fiber optic internet customers who don’t have online security and/or tech support. This may stem from the fact that there was a recent security breach, or that there has been a lack of tech support to service the demand for it or a combination of both.
Conclusion
By using a variety of data modelling and analysis techniques, like clustering, random forests, and pivot tables, I was able to discover the following reasons that explain why a customer may churn:
1) The shorter the tenure, the more likely they are to churn.
2) Month-by-month contract users are more likely to churn than those with a one or two-year contract
3) Most importantly and specifically, there are some major issues with fiber optic customers. The root of the problem is very hard to deduce, but the issue significantly impacts fiber optic users without tech support and or online security.
What I recommend to management is to someone increase the adoption rate for tech support and online security for fiber optic users. This may be done by simply promoting it more and trying to upsell it, or by providing it for free if it provides positive value to the company based on a cost benefit analysis (will the costs of providing it for free be offset by the reduction in churn?). I also encourage management to entice customers to switch to a one or two-year contract, to give customers fewer opportunities to churn away to other ISPs.
Comments