Customer retention is the primary pillar for building virtually any subscription-based business, including software, video game, media, and telecom businesses. Nowadays, it is common to use advanced machine learning techniques to predict customer churn probability as accurately as possible. However, a good churn prevention solution requires more than just accuracy.
When facing a churn problem, a churn prediction without explanatory power cannot provide much business value. It is important to understand where churn is coming from and what actions should be taken to prevent it. In order to do that, let us ask some fundamental questions:
The four “W”s are the fundamental questions that we should ask before finding a churn solution. If we know “who,” we can optimize and personalize treatment options. If we know “when,” we can optimize the time of the treatment to achieve maximum efficiency in terms of retention, incremental lifetime value, and treatment costs. If we know “why,” we can improve our product, service, or customer relationship management. If we know “what,” we can improve customer satisfaction and reduce the cost of churn prevention activities.
The ability to identify and interpret churn patterns and prescribe right treatments is as important as achieving churn prediction accuracy. In this article, we discuss how to build a solution that helps to quantify, investigate, and fight customer churn, complaints, and other issues related to customer dissatisfaction. The described approach has been successfully implemented for several clients and proved itself to be a powerful generic framework.
From the functional perspective, the solution is designed to produce a set of scores and indicators for individual users that can be operationalized in two ways:
We have found that, in many practical cases, the following scores and indicators are sufficient for supporting a wide range of churn prevention activities: churn triggers associated with a user, churn risk level, churn risk profile (survival curve), and prescribed treatments. These outputs are shown in the figure below.
From the technical standpoint, it is important to leverage all available data sources, including well-structured account data, event sequences (such as the clickstream), user-generated content (such as product reviews and customer support call transcripts), and, finally, information about the available treatments (such as special offers or loyalty perks).
In the next section, we discuss the overall architecture of the solution that provides the above functionality as well as the design of the models and components it is assembled from.
The architecture of the solution generally accounts for both the goals we want to achieve and the data needed to achieve those goals. For the sake of illustration, let us assume the following four typical data sources:
The textual data enable us to build a sentiment model for understanding customer churn reasons. The customer interactions data allow us to build an event model for churn prediction, and the offer data allow us to build a recommendation system based on customer reaction to historical offers. Hence, we created four major models to make an end-to-end churn-prediction/offer-recommendation pipeline:
Let us dive into each model separately.
The sentiment model analyzes textual data and identifies churn triggers. The training data are usually taken for one month prior to the churn event in order to capture churn-related topics. However, the time window can vary across industries, as the patience of customers in their decision-making could be different.
The Bidirectional Encoder Representations from Transformers (BERT) model is typically utilized to extract negative snippets from the user-generated texts; topic modelling is utilized on the negative snippets, and it outputs a word distribution for each topic from which we can identify churn drivers and churn indicators. This process is summarized in the figure below.
The sentiment model provides interpretative power for customer churn analytics, allowing systematic treatment and personalized treatment. For systematic treatment, we can use this model to identify groups of customers who have the same churn triggers as well as different types of treatment for different groups.
For personalized treatment, we can use this model to identify the exact churn driver for each customer and apply treatment at the account level. Comparing the two treatment approaches, personalized treatment can be more effective and can lower cost. However, it generally requires better modelling accuracy and granularity.
The event model helps to analyze customer journeys (i.e., sequences of interactions) and estimates the churn probability. We adopted the transformer model architecture of Vaswani et al. (2017), which is based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. This model has the following advantages over traditional LSTM:
The limitation of the transformer model is that it can process only fixed-length sequences. In our case, we use padding zeros to ensure a fixed length.
We also use clustering to find typical interaction patterns that indicate churn. The process includes the following steps:
Clustering in the semantic space of attention weights provides useful insights into the behavior of churners and highlights typical patterns.
For the survival model, we experimented with several approaches and identified some limitations as described below:
In most cases, the best model for our scenario is a binary classification model for each month using gradient-boosted decision trees. It is also common to observe that the outputs from the event model and sentiment model significantly improve the performance of the survival model when used as its input features. The output of the model is a risk profile that helps to determine the optimal treatment time.
Finally, we adopted the uplift modeling approach with bias correction on sampling selection for treatment prescription. The main idea is to estimate the potential impact of treatments on churn, using the survival model described in the previous section. Using customer features and treatments, we perform the following steps:
The biggest challenge, however, is often not the uplift estimation but the correction of the selection bias in the control and treatment groups used for model evaluation. The ideal way to obtain unbiased samples for the treatment model development is to conduct A/B tests for all treatments of interest. In practice, this approach is prohibitively expensive and time consuming, so one typically needs to deal with biased data obtained under a legacy treatment-assignment policy. In this case, we need to use statistical methods, such as propensity score matching (PSM) or the Heckman correction, to perform bias correction.
Bias correction is applied at the training stage before input is fed into the survival model. At the inference stage, we use the risk estimates to calculate the uplift value for each user and rank the offers for individual users based on the uplift value. The complete process is shown in the figure below.
The churn prevention suite described above is a complex solution, and it is critically important to have a solid validation methodology to ensure its quality and performance. The model validation is normally done based on the model stability, prediction accuracy, and meaningfulness of the interpretation results. The reference validation workflow includes the following steps:
The churn prevention solution helps to identify and improve upon areas where customer service was lacking. It also helps to improve the efficiency of various analytics and treatment processes, including the following:
The solution described in this article offers a prescriptive framework that aims to answer the four “W”s (who, when, why, what) of customer churn. The sentiment and event models offer an explanatory modelling framework on churn activities, helping business users to understand churn reasons. The survival model offers a long-term (six-month and more) churn prediction, helping business users to plan retention strategies in advance of churn events. The treatment model picks the best offers, which enables a higher level of automation.