The aim of the project is to develope a customer interaction strategy based on analytical data for gym chain Model Fitness.
First we will preprocess the data and study it:
We will develope a model for predicting the probability of churn (for the upcoming month) for each customer.
Draw up typical user portraits: select the most outstanding groups and describe their main features
Analyze the factors that impact churn most
Finally we will draw basic conclusions and develop recommendations on how to improve customer service:
Step 1. Data preprocessing
Reading the data base
Renaming columns
Checking and changing the data types
Checking for duplicated data
Checking for missing values
Step 2. Exploratory data analysis (EDA)
Statistical summary of the data
Splitting the data in two groups: left and stayed customers
Feature distributions for those who left (churn) and those who stayed
Correlation matrix
Step 3. Build a model to predict user churn
Dividing the data into train and validation sets
Logistic Regression model
Random Forest model
Step 4. User clusters
Data Standardization
Model visualization
K-Means Clustering
Features distribution for clusters
Churn rate for clusters
General conclusion and recommendations
import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from numpy import median
from scipy import stats as st
import seaborn as sns
import math
from plotly.subplots import make_subplots
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster import hierarchy
from matplotlib.pyplot import cm
pip install seaborn --upgrade
sns.__version__
df = pd.read_csv('./gym_churn_us.csv')
display(df.head())
display(df.sample(6))
#getting rid of the upper case lettres
df.columns = df.columns.str.lower()
df.sample(6)
df.info(memory_usage='deep')
#Checking for duplicated data
df.duplicated().sum()
print('There are', df.duplicated().sum(), 'duplicates in the data set.')
# checking for missing values in the data set
df.isnull().sum()
df.shape
The data for analysis contains 14 rows with 4000 entries; data on characteristics (features) and churn for 4000 unique users.
The data does not have any missing values or duplicates.
We have renamed the columns using clear names without upper case letters.
#statistical summary for numerical variables
df.describe().round(2)
Checking what kind of values some features contain:
df['contract_period'].value_counts()
labels = df['contract_period']
values = df['contract_period'].value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values, opacity = 0.75)])
fig.update_layout(title_text='Shares of different types of contracts')
fig.show()
55% of customers buy short-term contracts (1 month), shares of yearly contracts and 6-months contntracts are almost the same (21-24%) respectively. These numbers will influence the churn rate on the lifetime period.
Checking for outliers:
#Calculatinug the 95th and 99th percentiles for the number of contract period (setting out outliers).
np.percentile(df['age'], [95, 99])
#Calculatinug the 95th and 99th percentiles for the number of additional_charges (setting out outliers).
np.percentile(df['avg_additional_charges_total'], [95, 99])
#Calculatinug the 95th and 99th percentiles for the number avg_class_frequency (setting out outliers).
np.percentile(df['avg_class_frequency_total'], [95, 99])
The age of users vary vary from 18 to 41 years old. Middle 50% of users are from 27 to 31 years old.
Not more than 1% of users are over 37 yeras old. Thus users elder than 37 years old are outliers for our distribution.
Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400.
Middle 50% of customers get to the gym 1-2.5 times a week. Only 5% of customers visit gym more than 3 times a week, and only 1% - more than 4 times a week.
The data split into customers that churn and those who did not:
#creating a table
df_churn_split = df.groupby('churn').mean()
df_churn_split.round(2).T
Among those who left and who stayed the percentage of men and women is the same, gender do not impact the churn rate. Those who live far churn more. People that buy short contracts (1-6 months) tend to churn more. Younger people (27 years old on average) churn more; as well as those who have fewer visits per week and spend less money additionally in the gym.
Firstly we will have a general view on the features ditribution on pairplots and then look into distrubution of every feature individually.
#plotting general view for the non-boolean features
sns.pairplot(df[['contract_period', 'age', 'avg_additional_charges_total', 'month_to_end_contract',
'lifetime', 'avg_class_frequency_total', 'avg_class_frequency_current_month', 'churn']], hue='churn')
plt.title('Bar histograms and feature distributions for those who churn and those who stayed')
plt.show()
General view of the pairplots prove the early conclusions. Features of those who churn more are as follows:
Next we will split the db and ajust them for visualisation of distribution and plot the charts.
#creating separete db for users that churn
df_churn = df.query('churn==1').drop(['churn'], axis = 1)
df_churn.shape
#creating separete db for users that didn't churn
df_stayed = df.query('churn!=1').drop(['churn'], axis = 1)
df_stayed.shape
df_categorical = df[['gender', 'age','near_location', 'partner', 'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']]
#adding column for counting
df_categorical.insert(0, "count", 1)
df_categorical['gender'] = df_categorical['gender'].replace(to_replace=0, value ="female").replace(to_replace=1, value ="male")
df_categorical['near_location'] = df_categorical['near_location'].replace(to_replace=0, value ="far").replace(to_replace=1, value ="near")
df_categorical['partner'] = df_categorical['partner'].replace(to_replace=0, value ="not_partner").replace(to_replace=1, value ="partner")
df_categorical['churn'] = df_categorical['churn'].replace(to_replace=0, value ="stayed").replace(to_replace=1, value ="left")
df_categorical['promo_friends'] = df_categorical['promo_friends'].replace(to_replace=0, value ="alone").replace(to_replace=1, value ="through_friends")
df_categorical['phone'] = df_categorical['phone'].replace(to_replace=0, value ="no_phone").replace(to_replace=1, value ="with_phone")
df_categorical['group_visits'] = df_categorical['group_visits'].replace(to_replace=0, value ="no_group_visits").replace(to_replace=1, value ="with_group_visits")
df_hist = df[['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month', 'churn']]
parameters = ['gender', 'age', 'near_location', 'partner',
'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']
parameters_dis = ['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month']
for x in parameters:
plt.figure(figsize=(18,5))
ax = sns.barplot(x=x, y="count", data=df_categorical, hue='churn', estimator=sum, palette="mako")
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
plt.title('Distribution of {}'.format(x))
plt.show()
for x in parameters_dis:
plt.figure(figsize=(18,5))
sns.kdeplot(data=df_hist, x=x, hue="churn", hue_order=[1,0], alpha = 0.6 , palette="mako", fill=True)
plt.show()
The chart and the table gave us the following information:
Number of men is slightly more than women, churn rate is the same regardless gender.
Middle 50% of the customers are between 27 and 31 years old, the younger customers churn more. Starting from age 28-29 and up percentage of customers that stay is bigger than those who left.
Those who live far from the gyms churn more.
Likewise when users are employees of a partner company they chen less.
There munber of customers that were brought to the gym by their friends is almost twice smaller than the number of those who came alone. But the churn rate among those with friend is much lower.
Mostly customers tend to share the contact information with their gym; the churn rate among those without contact information is expectidly higher.
People that made no group visits churn more.
Generally the highest churn rate is among those who have 1 month left till the end of their contract. Next popular waves of churn are: 6 months, 5 months and 12 months (right after the signing of contract) prior to the end of contract.
Average level of charges is less among those who churn; people thay stay tend to spend more on gym additional gym services:cafe, athletic goods, cosmetics, massages, etc.
Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400.
The higher frequency of the visits the less customers curn.
#defining correlation between the features
corr = df.corr()
plt.figure(figsize=(18,18))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
cmap=sns.diverging_palette(220, 10, as_cmap=True))
plt.show()
The matrix shows that we have 2 pairs of itercorrelated features: month_to_end_contract with contract_period and avg_class_frequency_current_month with avg_class_frequency_current_total. One of the features in every couple can be omitted in the model.
#dropping extra columns of duplicated features and target variable column
X = df.drop(['churn', 'avg_class_frequency_current_month', 'contract_period'] , axis = 1)
y = df['churn']
X.columns
# divide the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
lr_model = LogisticRegression(random_state=0)
# training the model
lr_model.fit(X_train, y_train)
# use the trained model to make predictions
lr_predictions = lr_model.predict(X_test)
lr_probabilities = lr_model.predict_proba(X_test)[:,1]
# defining the new model's algorithm based on the random forest algorithm
rf_model = RandomForestClassifier(n_estimators = 100, random_state=0)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_probabilities = rf_model.predict_proba(X_test)[:,1]
def print_all_metrics(y_true, y_pred, y_proba, title = 'Classification metrics'):
print(title)
print('\tAccuracy: {:.2f}'.format(accuracy_score(y_true, y_pred)))
print('\tPrecision: {:.2f}'.format(precision_score(y_true, y_pred)))
print('\tRecall: {:.2f}'.format(recall_score(y_true, y_pred)))
print('\tF1: {:.2f}'.format(f1_score(y_true, y_pred)))
print('\tROC_AUC: {:.2f}'.format(roc_auc_score(y_true, y_proba)))
# printing all metrics for both models:
print_all_metrics(y_test, lr_predictions, lr_probabilities , title='Metrics for logistic regression:')
print_all_metrics(y_test, rf_predictions, rf_probabilities, title = 'Metrics for random forest:')
Now we will create a separate data base with models metrics results to visualise them and choo the better one.
#creating a function
def metrics(y_true, y_pred, y_proba):
all_metrics = []
all_metrics.append(accuracy_score(y_true, y_pred))
all_metrics.append(precision_score(y_true, y_pred))
all_metrics.append(recall_score(y_true, y_pred))
all_metrics.append(f1_score(y_true, y_pred))
all_metrics.append(roc_auc_score(y_true, y_proba))
return all_metrics
lr_metrics = metrics(y_test, lr_predictions, lr_probabilities)
rf_metrics = metrics(y_test, rf_predictions, rf_probabilities)
#Creating a db for visualisation of results:
metrics_list = ['accuracy_score', 'precision_score', 'recall_score', 'f1_score', 'roc_auc_score']
data={'metrics': metrics_list, 'LogisticRegression': lr_metrics, 'RandomForestClassifier':rf_metrics
}
df_models = pd.DataFrame(data, columns = ['metrics', 'LogisticRegression', 'RandomForestClassifier'])
print (df_models)
#visualizing results
x0 = df_models['LogisticRegression']
x1 = df_models['RandomForestClassifier']
labels = df_models['metrics']
x = np.arange(len(labels))
fig, ax = plt.subplots(figsize=(10,4))
x0.plot(kind="bar",color="teal",alpha=0.5, edgecolor='k').legend()
x1.plot(kind="bar", color="lightblue", alpha=0.5, edgecolor='k', linewidth=1).legend()
ax.set_ylabel('Scores')
ax.set_xlabel('Metrics', size=15)
ax.set_title('Evaluation of models results', size=20)
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)
plt.show()
According to the models' metrics the random forest classifier model represents the target variable in a more accurate way as all the metrics give higher scores.
# obligatory standardization of data before passing it to the algorithm
sc = StandardScaler()
X_sc = sc.fit_transform(X)
# visualizing with dendrogram plots
linked = linkage(X_sc, method = 'ward')
cmap = cm.rainbow(np.linspace(0, 1, 5))
hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])
plt.figure(figsize=(15, 10))
dendrogram(linked, orientation='top')
plt.axhline(linestyle='--', y=53)
plt.grid(axis='y'); plt.xlabel('leaves'); plt.ylabel('distances')
plt.title('Hierarchical clustering')
plt.show()
As the dendrograms cannot tell us how many clusters we should have, we have settled the number by ourselves (n=5). In the chart this place is marked with a discreet line. At this place the distance between clusters corresponds to the required number of clusters (5).
Generally speaking we cannot use the dendrogram as a tool for determining the number of clusters in data. But the dendrogram is most accurate at the bottom, showing which items (leaves) are very similar. Though in our case in is not readable we can get a general impession of the method.
As we have already standartized the data we are defining K-mreans right away:
km = KMeans(n_clusters = 5, random_state=0) # setting the number of clusters as 5
labels = km.fit_predict(X_sc) # applying the algorithm to the data and forming a cluster vector
# store cluster labels into the field of our dataset
df['cluster'] = labels
# print the statistics of the mean feature values per cluster
df_clusters = df.groupby(['cluster']).mean().reset_index()
df_clusters
The table clearly illustrates that mean values for the most of features differ between the clusters. We will plot the features distribution for every cluster to make it more obvious.
First we will ajust the data base and then wisualise the distribution of features fgor every cluster.
#creating a separate db
df_cl_charts = df
#ajusting the db for visualisation
df_cl_charts['gender'] = df_cl_charts['gender'].replace(to_replace=0, value ="female").replace(to_replace=1, value ="male")
df_cl_charts['near_location'] = df_cl_charts['near_location'].replace(to_replace=0, value ="far").replace(to_replace=1, value ="near")
df_cl_charts['partner'] = df_cl_charts['partner'].replace(to_replace=0, value ="not_partner").replace(to_replace=1, value ="partner")
df_cl_charts['churn'] = df_cl_charts['churn'].replace(to_replace=0, value ="stayed").replace(to_replace=1, value ="left")
df_cl_charts['promo_friends'] = df_cl_charts['promo_friends'].replace(to_replace=0, value ="alone").replace(to_replace=1, value ="through_friends")
df_cl_charts['phone'] = df_cl_charts['phone'].replace(to_replace=0, value ="no_phone").replace(to_replace=1, value ="with_phone")
df_cl_charts['group_visits'] = df_cl_charts['group_visits'].replace(to_replace=0, value ="no_group_visits").replace(to_replace=1, value ="with_group_visits")
df_cl_charts.insert(0, "count", 1)
#plotting graphs
parameters_cat = ['gender', 'age', 'near_location', 'partner',
'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']
parameters_dis = ['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month']
for x in parameters_cat:
plt.figure(figsize=(18,3))
ax = sns.barplot(x=x, y="count", data=df_cl_charts, hue='cluster', estimator=sum, palette="mako")
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
plt.title('Distribution of {} feature'.format(x))
plt.show()
for x in parameters_dis:
plt.figure(figsize=(18,3))
sns.kdeplot(data=df_cl_charts, x=x, hue="cluster", alpha = 0.4 , palette="mako", fill=True)
plt.title('Distribution of {} feature'.format(x))
plt.show()
The features have distributed among clusters the following way:
We are going to calculate and visualize the churn rate for every cluster.
#creating a table and calculating the rate
df_clusters_churn = df_clusters[['cluster','churn']].sort_values(by='churn', ascending=True).reset_index(drop=True)
df_clusters_churn['cluster'] = df_clusters_churn['cluster'].astype(str)
df_clusters_churn['churn'] = df_clusters_churn['churn']*100
df_clusters_churn
# plotting barh chart
plt.figure(figsize=(9,5))
ax = sns.barplot(df_clusters_churn.cluster, df_clusters_churn.churn, palette='mako')
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="Clusters", ylabel='Churn rate')
ax.set_xticklabels(df_clusters_churn.cluster)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(df_clusters_churn["churn"].iteritems()):
ax.text(i ,v[1], "{:.1f}%".format(v[1]), color='k', va ='bottom', rotation=0)
plt.title('Churn rate by clusters')
plt.show()
The higest churn rate is in chuster no. 1 (40% of customers churn), clusters 2 and 3 have the lowest churn rate (17.4%-17,5% of customers leave the gyms).
The RF model aggregared 5 groups but we can see that according to the churn rate only 4 have dramatic difference. Thus forming the marketing strategy we can split it into 4 general groups.
In course of data preprocessing we studied data with 4000 entries and 13 features.
We have:
The age of users vary from 18 to 41 years old. Middle 50% of users are from 27 to 31 years old.
Not more than 1% of users are over 37 yeras old. Thus users elder than 37 years old are outliers for our distribution.
Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400.
Middle 50% of customers get to the gym 1-2.5 times a week. Only 5% of customers visit gym more than 3 times a week, and only 1% - more than 4 times a week.
The following people tend to churn more: those who live far from gym, younger ones (27 yeras on average), are not emplyees of gym partners, came alone, bought short-term contracts, did not have group visits, did not buy additional services in gym, who have less than 2 months till the end of the contract, visit gym only 1-2 times a week.
We have defined correlation between the features and defined that we have 2 pairs of itercorrelated features: month_to_end_contract with contract_period and avg_class_frequency_current_month with avg_class_frequency_current_total. One of the features in every couple can be omitted in the model for ML and predictions.
We used two methods for creating ML model using unsupervised ML: Logistic Regression model and Random Forest model (RandomForestClassifier). All metrics proved that the second model is better in representing the target variable (all the metrics got higher scores).
Next we defined 5 clusters and looked into feature distribution for every cluster.
As to the churn rate as a target value, it is clearly distinguished between 4 out of 5 clusters. The higest churn rate is in chuster no. 1 (40% of customers churn), clusters 2 and 3 have the lowest churn rate (17.4%-17,5% of customers leave the gyms).
General recomendations:
Main marketring efforts should be aimed at the age group of 27 - 31 years old (they are the middle 50% of the customers). Only 5% of customers are elder than 34 years old.
Additional marketing efforts can be channaled into creating additional niche ang generating demand among thouse groups of potential customers that are not in the middle 50% of current customers:
And 12 months before the end of the contract (right after the signing it) the gym can offer something to the client to help him turn gym visits into his new routine:
5 months and prior to the end of the contract the gym can offer some bonuses to increase client loyalty and encourage him/her maintain the sports routine:
Additional efforts should be made to ecourage people to buy side gym services: cafe, athletic goods, cosmetics, massages, etc. As people get more positive experience, assosiate themselves with the gym more, with the certain lifestyle and finally churn less.
As the more frequency of the visits the less customers curn, the gym should encourage customets not to skip the training sessions: send phone reminders/calls, personnal contacts with instructors, some bonuse system for every month without skipping sessins.
As those who came with friends churn less, the marketing department should launch loyalty programms and "come with a friend" campaigns.