The aim of the project is to develope a customer interaction strategy based on analytical data for gym chain Model Fitness.
First we will preprocess the data and study it:

find missing values and duplicates
look into features distribution depending on the churn.

We will develope a model for predicting the probability of churn (for the upcoming month) for each customer.
Draw up typical user portraits: select the most outstanding groups and describe their main features
Analyze the factors that impact churn most

Finally we will draw basic conclusions and develop recommendations on how to improve customer service:

Identify target groups
Suggest measures to cut churn
.

Table of contents:¶

Step 1. Data preprocessing
Reading the data base
Renaming columns
Checking and changing the data types
Checking for duplicated data
Checking for missing values
Step 2. Exploratory data analysis (EDA)
Statistical summary of the data
Splitting the data in two groups: left and stayed customers
Feature distributions for those who left (churn) and those who stayed
Correlation matrix
Step 3. Build a model to predict user churn
Dividing the data into train and validation sets
Logistic Regression model
Random Forest model
Step 4. User clusters
Data Standardization
Model visualization
K-Means Clustering
Features distribution for clusters
Churn rate for clusters
General conclusion and recommendations

import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from numpy import median
from scipy import stats as st
import seaborn as sns
import math 
from plotly.subplots import make_subplots
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage

from scipy.cluster import hierarchy
from matplotlib.pyplot import cm

pip install seaborn --upgrade

Collecting seaborn
  Downloading seaborn-0.11.1-py3-none-any.whl (285 kB)
Requirement already satisfied, skipping upgrade: pandas>=0.23 in c:\users\hp\anaconda3\lib\site-packages (from seaborn) (1.0.5)
Requirement already satisfied, skipping upgrade: numpy>=1.15 in c:\users\hp\anaconda3\lib\site-packages (from seaborn) (1.18.5)
Requirement already satisfied, skipping upgrade: matplotlib>=2.2 in c:\users\hp\anaconda3\lib\site-packages (from seaborn) (3.2.2)
Requirement already satisfied, skipping upgrade: scipy>=1.0 in c:\users\hp\anaconda3\lib\site-packages (from seaborn) (1.5.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in c:\users\hp\appdata\roaming\python\python38\site-packages (from pandas>=0.23->seaborn) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\hp\anaconda3\lib\site-packages (from pandas>=0.23->seaborn) (2020.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib>=2.2->seaborn) (1.2.0)Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\hp\appdata\roaming\python\python38\site-packages (from python-dateutil>=2.6.1->pandas>=0.23->seaborn) (1.15.0)
Installing collected packages: seaborn
  Attempting uninstall: seaborn
    Found existing installation: seaborn 0.11.0
    Uninstalling seaborn-0.11.0:
      Successfully uninstalled seaborn-0.11.0
Successfully installed seaborn-0.11.1

sns.__version__

'0.11.0'

Part 1. Data preprocessing ¶

next part

Reading the data base ¶

df = pd.read_csv('./gym_churn_us.csv')
display(df.head())

display(df.sample(6))

Renaming columns ¶

#getting rid of the upper case lettres
df.columns = df.columns.str.lower()
df.sample(6)

Checking and changing the data types ¶

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             4000 non-null   int64  
 1   near_location                      4000 non-null   int64  
 2   partner                            4000 non-null   int64  
 3   promo_friends                      4000 non-null   int64  
 4   phone                              4000 non-null   int64  
 5   contract_period                    4000 non-null   int64  
 6   group_visits                       4000 non-null   int64  
 7   age                                4000 non-null   int64  
 8   avg_additional_charges_total       4000 non-null   float64
 9   month_to_end_contract              4000 non-null   float64
 10  lifetime                           4000 non-null   int64  
 11  avg_class_frequency_total          4000 non-null   float64
 12  avg_class_frequency_current_month  4000 non-null   float64
 13  churn                              4000 non-null   int64  
dtypes: float64(4), int64(10)
memory usage: 437.6 KB

Checking for duplicated data ¶

#Checking for duplicated data
df.duplicated().sum()
print('There are', df.duplicated().sum(), 'duplicates in the data set.')

There are 0 duplicates in the data set.

Checking for missing values ¶

# checking for missing values in the data set
df.isnull().sum()

gender                               0
near_location                        0
partner                              0
promo_friends                        0
phone                                0
contract_period                      0
group_visits                         0
age                                  0
avg_additional_charges_total         0
month_to_end_contract                0
lifetime                             0
avg_class_frequency_total            0
avg_class_frequency_current_month    0
churn                                0
dtype: int64

df.shape

(4000, 14)

Conclusion¶

The data for analysis contains 14 rows with 4000 entries; data on characteristics (features) and churn for 4000 unique users.
The data does not have any missing values or duplicates.
We have renamed the columns using clear names without upper case letters.

Step 2. Exploratory data analysis (EDA) ¶

back to top

Statistical summary of the data ¶

#statistical summary for numerical variables
df.describe().round(2)

Checking what kind of values some features contain:

df['contract_period'].value_counts()

1     2207
12     960
6      833
Name: contract_period, dtype: int64

labels = df['contract_period']
values = df['contract_period'].value_counts()
fig = go.Figure(data=[go.Pie(labels=labels, values=values, opacity = 0.75)])
fig.update_layout(title_text='Shares of different types of contracts')
fig.show()

55% of customers buy short-term contracts (1 month), shares of yearly contracts and 6-months contntracts are almost the same (21-24%) respectively. These numbers will influence the churn rate on the lifetime period.

Checking for outliers:

#Calculatinug the 95th and 99th percentiles for the number of contract period (setting out outliers). 
np.percentile(df['age'], [95, 99])

array([34., 37.])

#Calculatinug the 95th and 99th percentiles for the number of additional_charges (setting out outliers). 
np.percentile(df['avg_additional_charges_total'], [95, 99])

array([323.44087589, 400.99612505])

#Calculatinug the 95th and 99th percentiles for the number avg_class_frequency (setting out outliers). 
np.percentile(df['avg_class_frequency_total'], [95, 99])

array([3.53564837, 4.19757925])

The age of users vary vary from 18 to 41 years old. Middle 50% of users are from 27 to 31 years old.

Not more than 1% of users are over 37 yeras old. Thus users elder than 37 years old are outliers for our distribution.

Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400.

Middle 50% of customers get to the gym 1-2.5 times a week. Only 5% of customers visit gym more than 3 times a week, and only 1% - more than 4 times a week.

Splitting the data in two groups: left and stayed customers ¶

back to top

previous part

The data split into customers that churn and those who did not:

#creating a table 
df_churn_split = df.groupby('churn').mean()
df_churn_split.round(2).T

Among those who left and who stayed the percentage of men and women is the same, gender do not impact the churn rate. Those who live far churn more. People that buy short contracts (1-6 months) tend to churn more. Younger people (27 years old on average) churn more; as well as those who have fewer visits per week and spend less money additionally in the gym.

Feature distributions for those who left (churn) and those who stayed ¶

back to top

previous part

Firstly we will have a general view on the features ditribution on pairplots and then look into distrubution of every feature individually.

#plotting general view for the non-boolean features
sns.pairplot(df[['contract_period', 'age', 'avg_additional_charges_total', 'month_to_end_contract', 
                'lifetime', 'avg_class_frequency_total', 'avg_class_frequency_current_month', 'churn']], hue='churn') 
plt.title('Bar histograms and feature distributions for those who churn and those who stayed')
plt.show()

General view of the pairplots prove the early conclusions. Features of those who churn more are as follows:

live far from gym,
younger (27 yeras on average),
are not emplyees of gym partners,
came alone,
bought shrt-term contracts,
did not have group visits,
did not buy additional services in gym,
have less than 2 months till the end of the contract,
visit gym not more than 1-2 times a week.

Next we will split the db and ajust them for visualisation of distribution and plot the charts.

#creating separete db for users that churn 
df_churn = df.query('churn==1').drop(['churn'], axis = 1) 
df_churn.shape

(1061, 13)

#creating separete db for users that didn't churn 
df_stayed = df.query('churn!=1').drop(['churn'], axis = 1) 
df_stayed.shape

(2939, 13)

df_categorical = df[['gender', 'age','near_location', 'partner', 'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']]

#adding column for counting 
df_categorical.insert(0, "count", 1)

df_categorical['gender'] = df_categorical['gender'].replace(to_replace=0, value ="female").replace(to_replace=1, value ="male")
df_categorical['near_location'] = df_categorical['near_location'].replace(to_replace=0, value ="far").replace(to_replace=1, value ="near")
df_categorical['partner'] = df_categorical['partner'].replace(to_replace=0, value ="not_partner").replace(to_replace=1, value ="partner")
df_categorical['churn'] = df_categorical['churn'].replace(to_replace=0, value ="stayed").replace(to_replace=1, value ="left")
df_categorical['promo_friends'] = df_categorical['promo_friends'].replace(to_replace=0, value ="alone").replace(to_replace=1, value ="through_friends")
df_categorical['phone'] = df_categorical['phone'].replace(to_replace=0, value ="no_phone").replace(to_replace=1, value ="with_phone")
df_categorical['group_visits'] = df_categorical['group_visits'].replace(to_replace=0, value ="no_group_visits").replace(to_replace=1, value ="with_group_visits")

df_hist = df[['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month', 'churn']]

parameters = ['gender', 'age', 'near_location', 'partner',
              'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']
parameters_dis = ['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month']

for x in parameters:
    plt.figure(figsize=(18,5))
    ax = sns.barplot(x=x, y="count", data=df_categorical, hue='churn', estimator=sum, palette="mako") 
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    plt.title('Distribution of {}'.format(x))        
    plt.show()
            
for x in parameters_dis:
    plt.figure(figsize=(18,5))
    sns.kdeplot(data=df_hist, x=x, hue="churn", hue_order=[1,0], alpha = 0.6 , palette="mako", fill=True)       
    plt.show()

Conclusion¶

The chart and the table gave us the following information:

Number of men is slightly more than women, churn rate is the same regardless gender.
Middle 50% of the customers are between 27 and 31 years old, the younger customers churn more. Starting from age 28-29 and up percentage of customers that stay is bigger than those who left.
Those who live far from the gyms churn more.
Likewise when users are employees of a partner company they chen less.
There munber of customers that were brought to the gym by their friends is almost twice smaller than the number of those who came alone. But the churn rate among those with friend is much lower.
Mostly customers tend to share the contact information with their gym; the churn rate among those without contact information is expectidly higher.
People that made no group visits churn more.
Generally the highest churn rate is among those who have 1 month left till the end of their contract. Next popular waves of churn are: 6 months, 5 months and 12 months (right after the signing of contract) prior to the end of contract.
Average level of charges is less among those who churn; people thay stay tend to spend more on gym additional gym services:cafe, athletic goods, cosmetics, massages, etc.
Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400.
The higher frequency of the visits the less customers curn.

Correlation matrix ¶

back to top

previous part

#defining correlation between the features
corr = df.corr()
plt.figure(figsize=(18,18))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
            cmap=sns.diverging_palette(220, 10, as_cmap=True))
plt.show()

The matrix shows that we have 2 pairs of itercorrelated features: month_to_end_contract with contract_period and avg_class_frequency_current_month with avg_class_frequency_current_total. One of the features in every couple can be omitted in the model.

Step 3. Build a model to predict user churn ¶

Dividing the data into train and validation sets ¶

#dropping extra columns of duplicated features and target variable column
X = df.drop(['churn', 'avg_class_frequency_current_month', 'contract_period'] , axis = 1)
y = df['churn']

X.columns

Index(['gender', 'near_location', 'partner', 'promo_friends', 'phone',
       'group_visits', 'age', 'avg_additional_charges_total',
       'month_to_end_contract', 'lifetime', 'avg_class_frequency_total'],
      dtype='object')

# divide the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Logistic Regression model ¶

lr_model = LogisticRegression(random_state=0)
# training the model
lr_model.fit(X_train, y_train)
# use the trained model to make predictions
lr_predictions = lr_model.predict(X_test)
lr_probabilities = lr_model.predict_proba(X_test)[:,1]

Random Forest model ¶

# defining the new model's algorithm based on the random forest algorithm
rf_model = RandomForestClassifier(n_estimators = 100, random_state=0)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_probabilities = rf_model.predict_proba(X_test)[:,1]

def print_all_metrics(y_true, y_pred, y_proba, title = 'Classification metrics'):
    print(title)
    print('\tAccuracy: {:.2f}'.format(accuracy_score(y_true, y_pred)))
    print('\tPrecision: {:.2f}'.format(precision_score(y_true, y_pred)))
    print('\tRecall: {:.2f}'.format(recall_score(y_true, y_pred)))
    print('\tF1: {:.2f}'.format(f1_score(y_true, y_pred)))
    print('\tROC_AUC: {:.2f}'.format(roc_auc_score(y_true, y_proba)))

# printing all metrics for both models:
print_all_metrics(y_test, lr_predictions, lr_probabilities , title='Metrics for logistic regression:')
print_all_metrics(y_test, rf_predictions, rf_probabilities, title = 'Metrics for random forest:')

Metrics for logistic regression:
	Accuracy: 0.88
	Precision: 0.75
	Recall: 0.74
	F1: 0.75
	ROC_AUC: 0.94
Metrics for random forest:
	Accuracy: 0.90
	Precision: 0.80
	Recall: 0.77
	F1: 0.79
	ROC_AUC: 0.94

Now we will create a separate data base with models metrics results to visualise them and choo the better one.

#creating a function 
def metrics(y_true, y_pred, y_proba):
    all_metrics = []
    all_metrics.append(accuracy_score(y_true, y_pred))
    all_metrics.append(precision_score(y_true, y_pred))
    all_metrics.append(recall_score(y_true, y_pred))
    all_metrics.append(f1_score(y_true, y_pred))
    all_metrics.append(roc_auc_score(y_true, y_proba))
    return all_metrics

lr_metrics = metrics(y_test, lr_predictions, lr_probabilities)

rf_metrics = metrics(y_test, rf_predictions, rf_probabilities)

#Creating a db for visualisation of results:
metrics_list = ['accuracy_score', 'precision_score', 'recall_score', 'f1_score', 'roc_auc_score']
data={'metrics': metrics_list, 'LogisticRegression': lr_metrics, 'RandomForestClassifier':rf_metrics
        }
df_models = pd.DataFrame(data, columns = ['metrics', 'LogisticRegression', 'RandomForestClassifier'])

print (df_models)

           metrics  LogisticRegression  RandomForestClassifier
0   accuracy_score            0.875000                0.896250
1  precision_score            0.750000                0.801047
2     recall_score            0.742424                0.772727
3         f1_score            0.746193                0.786632
4    roc_auc_score            0.938186                0.939226

#visualizing results
x0 = df_models['LogisticRegression']
x1 = df_models['RandomForestClassifier']
labels = df_models['metrics']

x = np.arange(len(labels))
fig, ax = plt.subplots(figsize=(10,4))

x0.plot(kind="bar",color="teal",alpha=0.5, edgecolor='k').legend()
x1.plot(kind="bar", color="lightblue", alpha=0.5, edgecolor='k', linewidth=1).legend()

ax.set_ylabel('Scores')
ax.set_xlabel('Metrics', size=15)
ax.set_title('Evaluation of models results', size=20)
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)

plt.show()

Conclusion.¶

According to the models' metrics the random forest classifier model represents the target variable in a more accurate way as all the metrics give higher scores.

Step 4. User clusters ¶

back to top

previous part

Data Standardization ¶

# obligatory standardization of data before passing it to the algorithm
sc = StandardScaler()
X_sc = sc.fit_transform(X)

Model visualization ¶

# visualizing with dendrogram plots
linked = linkage(X_sc, method = 'ward') 

cmap = cm.rainbow(np.linspace(0, 1, 5))
hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])

plt.figure(figsize=(15, 10))  
dendrogram(linked, orientation='top')
plt.axhline(linestyle='--', y=53)
plt.grid(axis='y'); plt.xlabel('leaves'); plt.ylabel('distances')
plt.title('Hierarchical clustering')
plt.show()

As the dendrograms cannot tell us how many clusters we should have, we have settled the number by ourselves (n=5). In the chart this place is marked with a discreet line. At this place the distance between clusters corresponds to the required number of clusters (5).
Generally speaking we cannot use the dendrogram as a tool for determining the number of clusters in data. But the dendrogram is most accurate at the bottom, showing which items (leaves) are very similar. Though in our case in is not readable we can get a general impession of the method.

K-Means Clustering ¶

back to top

As we have already standartized the data we are defining K-mreans right away:

km = KMeans(n_clusters = 5, random_state=0) # setting the number of clusters as 5
labels = km.fit_predict(X_sc) # applying the algorithm to the data and forming a cluster vector

# store cluster labels into the field of our dataset
df['cluster'] = labels
 
# print the statistics of the mean feature values per cluster
df_clusters = df.groupby(['cluster']).mean().reset_index()
df_clusters

The table clearly illustrates that mean values for the most of features differ between the clusters. We will plot the features distribution for every cluster to make it more obvious.

Features distribution for clusters ¶

back to top

First we will ajust the data base and then wisualise the distribution of features fgor every cluster.

#creating a separate db
df_cl_charts = df

#ajusting the db for visualisation
df_cl_charts['gender'] = df_cl_charts['gender'].replace(to_replace=0, value ="female").replace(to_replace=1, value ="male")
df_cl_charts['near_location'] = df_cl_charts['near_location'].replace(to_replace=0, value ="far").replace(to_replace=1, value ="near")
df_cl_charts['partner'] = df_cl_charts['partner'].replace(to_replace=0, value ="not_partner").replace(to_replace=1, value ="partner")
df_cl_charts['churn'] = df_cl_charts['churn'].replace(to_replace=0, value ="stayed").replace(to_replace=1, value ="left")
df_cl_charts['promo_friends'] = df_cl_charts['promo_friends'].replace(to_replace=0, value ="alone").replace(to_replace=1, value ="through_friends")
df_cl_charts['phone'] = df_cl_charts['phone'].replace(to_replace=0, value ="no_phone").replace(to_replace=1, value ="with_phone")
df_cl_charts['group_visits'] = df_cl_charts['group_visits'].replace(to_replace=0, value ="no_group_visits").replace(to_replace=1, value ="with_group_visits")
df_cl_charts.insert(0, "count", 1)

#plotting graphs

parameters_cat = ['gender', 'age', 'near_location', 'partner',
              'promo_friends', 'phone', 'group_visits', 'month_to_end_contract', 'lifetime', 'churn']
parameters_dis = ['avg_additional_charges_total', 'avg_class_frequency_total', 'avg_class_frequency_current_month']

for x in parameters_cat:
    plt.figure(figsize=(18,3))
    ax = sns.barplot(x=x, y="count", data=df_cl_charts, hue='cluster', estimator=sum, palette="mako") 
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    plt.title('Distribution of {} feature'.format(x))        
    plt.show()
    
for x in parameters_dis:
    plt.figure(figsize=(18,3))
    sns.kdeplot(data=df_cl_charts, x=x, hue="cluster", alpha = 0.4 , palette="mako", fill=True)
    plt.title('Distribution of {} feature'.format(x))
    plt.show()

Conclusion¶

The features have distributed among clusters the following way:

cluster 2 have only men; cluster 3 contains only women;
age is distributed evenly amonth clusters;
cluster 1 has only customers that live far; cluster 4 contains only customers that live near the gym;
cluster 4 does not include customers that are partners og the gym and got there though the friends;
cluster 0 contains customers that have not left their phone number to the gym; such customers are not included in other clusters;
group visits preferencies are olmost evenly distributed among all clusters.
lifetime and churn featurea re evenly distributed among clusters.
Cluster 0 is the smallest one and cluster 4 is the biggest.

Churn rate for clusters ¶

back to top

We are going to calculate and visualize the churn rate for every cluster.

#creating a table and calculating the rate
df_clusters_churn = df_clusters[['cluster','churn']].sort_values(by='churn', ascending=True).reset_index(drop=True)
df_clusters_churn['cluster'] = df_clusters_churn['cluster'].astype(str)
df_clusters_churn['churn'] = df_clusters_churn['churn']*100
df_clusters_churn

# plotting barh chart 
plt.figure(figsize=(9,5))
ax = sns.barplot(df_clusters_churn.cluster, df_clusters_churn.churn, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="Clusters", ylabel='Churn rate')

ax.set_xticklabels(df_clusters_churn.cluster)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(df_clusters_churn["churn"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Churn rate by clusters')
plt.show()

The higest churn rate is in chuster no. 1 (40% of customers churn), clusters 2 and 3 have the lowest churn rate (17.4%-17,5% of customers leave the gyms).
The RF model aggregared 5 groups but we can see that according to the churn rate only 4 have dramatic difference. Thus forming the marketing strategy we can split it into 4 general groups.

General conclusion and recommendations ¶

back to top

previous part

In course of data preprocessing we studied data with 4000 entries and 13 features.
We have:

checked the data for missing values and duplecates;
renamed the columns using clear names without upper case letters.

The age of users vary from 18 to 41 years old. Middle 50% of users are from 27 to 31 years old.
Not more than 1% of users are over 37 yeras old. Thus users elder than 37 years old are outliers for our distribution. Middle 50% of customers spend from USD 69 to 211 during their contract period. Only 1% of customers spend over USD 400. Middle 50% of customers get to the gym 1-2.5 times a week. Only 5% of customers visit gym more than 3 times a week, and only 1% - more than 4 times a week.
The following people tend to churn more: those who live far from gym, younger ones (27 yeras on average), are not emplyees of gym partners, came alone, bought short-term contracts, did not have group visits, did not buy additional services in gym, who have less than 2 months till the end of the contract, visit gym only 1-2 times a week.
We have defined correlation between the features and defined that we have 2 pairs of itercorrelated features: month_to_end_contract with contract_period and avg_class_frequency_current_month with avg_class_frequency_current_total. One of the features in every couple can be omitted in the model for ML and predictions.
We used two methods for creating ML model using unsupervised ML: Logistic Regression model and Random Forest model (RandomForestClassifier). All metrics proved that the second model is better in representing the target variable (all the metrics got higher scores).
Next we defined 5 clusters and looked into feature distribution for every cluster.
As to the churn rate as a target value, it is clearly distinguished between 4 out of 5 clusters. The higest churn rate is in chuster no. 1 (40% of customers churn), clusters 2 and 3 have the lowest churn rate (17.4%-17,5% of customers leave the gyms).

General recomendations:

Main marketring efforts should be aimed at the age group of 27 - 31 years old (they are the middle 50% of the customers). Only 5% of customers are elder than 34 years old.
Additional marketing efforts can be channaled into creating additional niche ang generating demand among thouse groups of potential customers that are not in the middle 50% of current customers:

18-27 years old,
over 34 years old.

As the highest churn rate is among those who have 1 month left till the end of their contract, additional individual offers should be made to the clients 2-1.5 months prior to the end of their contracts.

And 12 months before the end of the contract (right after the signing it) the gym can offer something to the client to help him turn gym visits into his new routine:

free individual session with instructor,
free consultation with a nutrition specialist,
composing an individual train program with instructor.

5 months and prior to the end of the contract the gym can offer some bonuses to increase client loyalty and encourage him/her maintain the sports routine:

free manual specialist session,
update of the individual training program according to the results of the previous period.

Additional efforts should be made to ecourage people to buy side gym services: cafe, athletic goods, cosmetics, massages, etc. As people get more positive experience, assosiate themselves with the gym more, with the certain lifestyle and finally churn less.
As the more frequency of the visits the less customers curn, the gym should encourage customets not to skip the training sessions: send phone reminders/calls, personnal contacts with instructors, some bonuse system for every month without skipping sessins.
As those who came with friends churn less, the marketing department should launch loyalty programms and "come with a friend" campaigns.

	gender	Near_Location	Partner	Promo_friends	Phone	Contract_period	Group_visits	Age	Avg_additional_charges_total	Month_to_end_contract	Lifetime	Avg_class_frequency_total	Avg_class_frequency_current_month
0	1	1	1	1	0	6	1	29	14.227470	5.0	3	0.020398	0.000000
1	0	1	0	0	1	12	1	31	113.202938	12.0	7	1.922936	1.910244
2	0	1	1	0	1	1	0	28	129.448479	1.0	2	1.859098	1.736502
3	0	1	1	1	1	12	1	33	62.669863	12.0	2	3.205633	3.357215
4	1	1	1	1	1	1	0	26	198.362265	1.0	3	1.113884	1.120078

	gender	Near_Location	Partner	Promo_friends	Phone	Contract_period	Group_visits	Age	Avg_additional_charges_total	Month_to_end_contract	Lifetime	Avg_class_frequency_total	Avg_class_frequency_current_month
29	0	1	1	1	1	6	1	27	159.148924	5.0	3	3.511568	3.528886
766	0	1	0	1	1	1	0	29	243.365528	1.0	7	2.431410	2.284297
1003	1	1	1	1	1	6	0	28	13.586564	6.0	1	3.316066	3.273928
2237	0	1	1	1	1	12	1	26	245.341986	12.0	5	3.342234	3.302304
2188	0	1	1	1	1	6	1	29	301.907485	6.0	5	1.025029	1.110048
129	0	1	0	0	1	12	1	28	5.865051	9.0	7	1.352800	1.316759

	gender	near_location	partner	promo_friends	phone	contract_period	group_visits	age	avg_additional_charges_total	month_to_end_contract	lifetime	avg_class_frequency_total	avg_class_frequency_current_month	churn
520	0	1	0	0	1	1	0	27	115.800883	1.0	1	1.029153	0.000000	1
3663	1	1	0	0	1	12	1	27	334.472656	7.0	0	3.684876	3.672933	0
2489	1	1	1	1	1	12	1	26	199.501236	12.0	5	0.000000	0.000000	0
2092	1	1	0	0	1	1	1	33	29.741681	1.0	1	2.407860	1.575433	1
130	1	1	1	1	1	12	0	28	243.820700	12.0	4	0.915233	0.850286	0
911	0	0	0	0	1	1	0	32	243.777010	1.0	2	0.643102	0.116537	1

	gender	near_location	partner	promo_friends	phone	contract_period	group_visits	age	avg_additional_charges_total	month_to_end_contract	lifetime	avg_class_frequency_total	avg_class_frequency_current_month	churn
count	4000.00	4000.00	4000.00	4000.00	4000.0	4000.00	4000.00	4000.00	4000.00	4000.00	4000.00	4000.00	4000.00	4000.00
mean	0.51	0.85	0.49	0.31	0.9	4.68	0.41	29.18	146.94	4.32	3.72	1.88	1.77	0.27
std	0.50	0.36	0.50	0.46	0.3	4.55	0.49	3.26	96.36	4.19	3.75	0.97	1.05	0.44
min	0.00	0.00	0.00	0.00	0.0	1.00	0.00	18.00	0.15	1.00	0.00	0.00	0.00	0.00
25%	0.00	1.00	0.00	0.00	1.0	1.00	0.00	27.00	68.87	1.00	1.00	1.18	0.96	0.00
50%	1.00	1.00	0.00	0.00	1.0	1.00	0.00	29.00	136.22	1.00	3.00	1.83	1.72	0.00
75%	1.00	1.00	1.00	1.00	1.0	6.00	1.00	31.00	210.95	6.00	5.00	2.54	2.51	1.00
max	1.00	1.00	1.00	1.00	1.0	12.00	1.00	41.00	552.59	12.00	31.00	6.02	6.15	1.00

churn	0	1
gender	0.51	0.51
near_location	0.87	0.77
partner	0.53	0.36
promo_friends	0.35	0.18
phone	0.90	0.90
contract_period	5.75	1.73
group_visits	0.46	0.27
age	29.98	26.99
avg_additional_charges_total	158.45	115.08
month_to_end_contract	5.28	1.66
lifetime	4.71	0.99
avg_class_frequency_total	2.02	1.47
avg_class_frequency_current_month	2.03	1.04

	cluster	gender	near_location	partner	promo_friends	phone	contract_period	group_visits	age	avg_additional_charges_total	month_to_end_contract	lifetime	avg_class_frequency_total	avg_class_frequency_current_month	churn
0	0	0.523316	0.862694	0.471503	0.305699	0.0	4.777202	0.427461	29.297927	144.208179	4.466321	3.940415	1.854211	1.723967	0.266839
1	1	0.499109	0.000000	0.488414	0.076649	1.0	3.032086	0.235294	28.721925	137.540009	2.853832	3.060606	1.770413	1.606619	0.399287
2	2	1.000000	0.995232	0.871275	0.641240	1.0	6.343266	0.467223	29.473182	148.877176	5.772348	4.205006	1.900802	1.816206	0.174017
3	3	0.000000	0.998800	0.912365	0.638655	1.0	6.186074	0.451381	29.302521	153.327058	5.665066	4.009604	1.941182	1.875403	0.175270
4	4	0.521361	1.000000	0.000000	0.002172	1.0	3.406951	0.422882	29.093411	146.503417	3.188993	3.470673	1.879344	1.749048	0.320058

	cluster	churn
0	2	17.401669
1	3	17.527011
2	0	26.683938
3	4	32.005793
4	1	39.928699