The aim of the project is to study effectivness of phone operators for virtual telephony service CallMeMaybe.
To define ineffective operators we will study the following features:

  • number of missed incoming calls (internal and external),
  • waiting time for incoming calls,
  • number of outgoing calls,
  • number of internal calls between operators.

Also we will study these features for every tariff plan the company has (A, B, C) and define ineffective operators for every of this plan.

We will test the hypothesis if the difference in operators performance for different tariff plans is statistically significant.
Finally we will draw basic conclusions and develop some recommendations on how to improve operators performance.

Table of contents:

In [1]:
import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from numpy import median
from scipy import stats as st
import seaborn as sns
import math 
from plotly.subplots import make_subplots

Part 1. Loading the data and preprocessing

Reading the data base

In [2]:
#adjustments of the path for the Practicum platform:
#df=pd.read_csv('/telecom_clients_us.csv')
df_cl = pd.read_csv('./telecom_clients_us.csv')
display(df_cl.head())
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29
3 167097 A 2019-09-01
4 168193 A 2019-10-16
In [3]:
#adjustments of the path for the Practicum platform:
#df=pd.read_csv('/telecom_dataset_us.csv')
df_dataset = pd.read_csv('./telecom_dataset_us.csv')
display(df_dataset.head())
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-04 00:00:00+03:00 in False NaN True 2 0 4
1 166377 2019-08-05 00:00:00+03:00 out True 880022.0 True 3 0 5
2 166377 2019-08-05 00:00:00+03:00 out True 880020.0 True 1 0 1
3 166377 2019-08-05 00:00:00+03:00 out True 880020.0 False 1 10 18
4 166377 2019-08-05 00:00:00+03:00 out False 880022.0 True 3 0 25

Checking and changing the data types

In [4]:
df_cl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 58.0 KB
In [5]:
#changing data type
df_cl['date_start'] = pd.to_datetime(df_cl['date_start'])
In [6]:
df_cl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      732 non-null    int64         
 1   tariff_plan  732 non-null    object        
 2   date_start   732 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 35.8 KB
In [7]:
df_dataset.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 7.3 MB
In [8]:
#'date' column
#changing data type
df_dataset['date'] = pd.to_datetime(df_dataset['date'], format=('%Y-%m-%d %H:%M:%S')) 
#getting rid of time zones
df_dataset['date']= df_dataset['date'].dt.tz_localize(None)
In [9]:
df_dataset['internal'] = df_dataset['internal'].astype(bool)
In [10]:
df_dataset.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              53902 non-null  int64         
 1   date                 53902 non-null  datetime64[ns]
 2   direction            53902 non-null  object        
 3   internal             53902 non-null  bool          
 4   operator_id          45730 non-null  float64       
 5   is_missed_call       53902 non-null  bool          
 6   calls_count          53902 non-null  int64         
 7   call_duration        53902 non-null  int64         
 8   total_call_duration  53902 non-null  int64         
dtypes: bool(2), datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 4.2 MB

Changing the data type for columns (object to datetime and object to boolean) has significantly improved the memory usage: from 58.0 KB to 35.8 KB and from 7.3 MB to 4.2 MB for the two data sets. After dealing with the missing values we will also change the type of 'operator_id' column.

Checking for missing values

In [11]:
# checking for missing values in the data set
df_dataset.isnull().sum()
Out[11]:
user_id                   0
date                      0
direction                 0
internal                  0
operator_id            8172
is_missed_call            0
calls_count               0
call_duration             0
total_call_duration       0
dtype: int64
In [12]:
df_dataset['operator_id'].value_counts()
Out[12]:
901884.0    348
901880.0    337
893804.0    325
885876.0    319
891410.0    290
           ... 
926214.0      1
946468.0      1
960674.0      1
928284.0      1
891948.0      1
Name: operator_id, Length: 1092, dtype: int64
In [13]:
df_dataset['user_id'].nunique()
Out[13]:
307
In [14]:
df_cl['tariff_plan'].value_counts()
Out[14]:
C    395
B    261
A     76
Name: tariff_plan, dtype: int64
In [15]:
df_cl.isnull().sum()
Out[15]:
user_id        0
tariff_plan    0
date_start     0
dtype: int64
In [16]:
df_dataset.shape
Out[16]:
(53902, 9)
In [17]:
df_dataset.groupby(['date']).size().plot(
    kind='bar',
    grid=True,
    title='Number of data without NaNs',
    figsize=(25,5)
)
plt.show()

We can see that there is clear repeatability in shares of collected data: amount of data collected on Saturdays and Sundays is significantly smaller than on other days of the week.

Studying missing values in 'operator_id".

In [18]:
df_dataset.groupby(['user_id'])['operator_id'].nunique().reset_index().sort_values(by='operator_id', ascending=False) 
Out[18]:
user_id operator_id
264 168187 50
175 167626 48
161 167497 30
271 168252 28
248 168062 27
... ... ...
137 167265 0
281 168329 0
241 168024 0
236 168013 0
226 167981 0

307 rows × 2 columns

According to the table users can have from 0 to 50 different opperators.

In [19]:
users_operators = df_dataset.groupby(['user_id'])['operator_id'].nunique().reset_index()\
.groupby(['operator_id'])['user_id'].nunique().reset_index()
users_operators.columns = ['operators_cnt', 'users_cnt']
users_operators.head(3)
Out[19]:
operators_cnt users_cnt
0 0 17
1 1 107
2 2 63
In [20]:
users_operators.plot(x='operators_cnt', y='users_cnt', kind='bar', grid=True, figsize=(20,5), cmap='Set2', \
                     title='Number of unique users per operators')

plt.show()

17 unique users are asigned to non operator;
1 operator has 107 unique users;
2 operatos have 63 unique users;
50 operatos have 1 unique user.

In [21]:
#Calculatinug the 95th and 99th percentiles for the number of users. 
np.percentile(users_operators['users_cnt'], [95, 99])
Out[21]:
array([58.65, 96.88])

Only 5% of operators have mothe than 60 unique users; only 1% of operators have mothe than 97 unique users.

In [22]:
#calculating percentage of missing values:
df_dataset.isnull().sum()/len(df_dataset)*100
Out[22]:
user_id                 0.000000
date                    0.000000
direction               0.000000
internal                0.000000
operator_id            15.160847
is_missed_call          0.000000
calls_count             0.000000
call_duration           0.000000
total_call_duration     0.000000
dtype: float64

Percentage of missing values in 'operator_id' is 15%. As one user can have more than 1 operator ( the number of operators per user varies from 1 to 50), we cannot use transform() method grouping by users to fill the missing valuses. We will have to drop these part of the data as we cannot restore the missing values here.

In [23]:
df_dataset = df_dataset.dropna(axis=0, subset=['operator_id'])
In [24]:
df_dataset.isnull().sum()/len(df_dataset)*100
Out[24]:
user_id                0.0
date                   0.0
direction              0.0
internal               0.0
operator_id            0.0
is_missed_call         0.0
calls_count            0.0
call_duration          0.0
total_call_duration    0.0
dtype: float64
In [25]:
df_dataset['internal'].value_counts()
Out[25]:
False    39861
True      5869
Name: internal, dtype: int64

As now we have only minor number of missing values in column 'internal' (<2% of the data set) we can drop them to proceed with the data further.

In [26]:
df_dataset = df_dataset.dropna(axis=0, subset=['internal'])

Now we can change the data type

In [27]:
# changing type of the 'operator_id' column:
df_dataset['operator_id']= df_dataset['operator_id'].astype(int)
In [28]:
df_dataset.sample()
Out[28]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
26211 167150 2019-10-16 out False 905566 False 5 296 404
In [29]:
print('There are', df_dataset['operator_id'].nunique(), 'unique operators in the amended data set.')
There are 1092 unique operators in the amended data set.
In [30]:
print('There are', df_dataset['user_id'].nunique(), 'unique users in the amended data set.')
There are 290 unique users in the amended data set.
In [31]:
#Checking data for zeros:
for i in df_dataset.columns:
    print(i, len(df_dataset[df_dataset[i]==0]))
user_id 0
date 0
direction 0
internal 39861
operator_id 0
is_missed_call 30212
calls_count 0
call_duration 15213
total_call_duration 955
In [32]:
#Checking data for zeros:
for i in df_cl.columns:
    print(i, len(df_cl[df_cl[i]==0]))
user_id 0
tariff_plan 0
date_start 0

We have a lot of zeros for internal calls and missed calls, as well as call duration. In the second data bse we do not have zeros at all.

Checking for duplicated data

In [33]:
#Checking for duplicated data
duplicated1=df_dataset.duplicated().sum()
print('There are', duplicated1, 'duplicates in the general data set.')
There are 4184 duplicates in the general data set.
In [34]:
#Checking for duplicated data
duplicated2=df_cl.duplicated().sum()
print('There are', duplicated2, 'duplicates in the additional data set (telecom_clients).')
There are 0 duplicates in the additional data set (telecom_clients).
In [35]:
#checking what kind of duplicates we have
duplicated_rows = df_dataset[df_dataset.duplicated()]
duplicated_rows.sample(15)
Out[35]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
8637 166658 2019-09-23 out False 890420 False 11 5453 5574
23211 167078 2019-11-15 in False 935348 False 11 851 923
45850 168064 2019-11-25 in False 939762 False 24 1000 1458
11313 166691 2019-09-30 in False 889410 False 5 232 264
52809 168361 2019-11-19 out False 965540 False 26 2188 2817
49694 168187 2019-11-28 out False 937898 False 6 213 347
17634 166941 2019-09-01 in False 896382 False 3 242 280
9123 166658 2019-10-16 out False 891158 True 2 0 73
4013 166503 2019-09-03 out False 884412 True 13 0 374
32366 167466 2019-10-11 in False 921818 False 26 2411 2638
37715 167634 2019-10-07 in False 918436 False 1 37 45
198 166377 2019-09-12 out False 880028 False 16 1030 1161
31978 167445 2019-11-20 in False 920902 False 1 41 48
18404 166974 2019-10-21 out False 900354 False 19 1701 1945
8635 166658 2019-09-23 out False 890412 True 3 0 110
In [36]:
#checking the dates when the data was duplicated:
duplicated_rows['date'].dt.date.unique()
Out[36]:
array([datetime.date(2019, 8, 5), datetime.date(2019, 8, 14),
       datetime.date(2019, 8, 15), datetime.date(2019, 8, 19),
       datetime.date(2019, 8, 22), datetime.date(2019, 8, 23),
       datetime.date(2019, 8, 27), datetime.date(2019, 8, 29),
       datetime.date(2019, 9, 3), datetime.date(2019, 9, 5),
       datetime.date(2019, 9, 10), datetime.date(2019, 9, 11),
       datetime.date(2019, 9, 12), datetime.date(2019, 9, 16),
       datetime.date(2019, 9, 23), datetime.date(2019, 9, 25),
       datetime.date(2019, 9, 26), datetime.date(2019, 9, 27),
       datetime.date(2019, 10, 3), datetime.date(2019, 10, 4),
       datetime.date(2019, 10, 8), datetime.date(2019, 10, 11),
       datetime.date(2019, 10, 14), datetime.date(2019, 10, 18),
       datetime.date(2019, 10, 22), datetime.date(2019, 10, 23),
       datetime.date(2019, 10, 28), datetime.date(2019, 10, 29),
       datetime.date(2019, 10, 31), datetime.date(2019, 11, 1),
       datetime.date(2019, 11, 5), datetime.date(2019, 11, 12),
       datetime.date(2019, 11, 13), datetime.date(2019, 11, 20),
       datetime.date(2019, 11, 21), datetime.date(2019, 11, 22),
       datetime.date(2019, 11, 25), datetime.date(2019, 10, 24),
       datetime.date(2019, 11, 11), datetime.date(2019, 11, 18),
       datetime.date(2019, 11, 28), datetime.date(2019, 10, 2),
       datetime.date(2019, 8, 18), datetime.date(2019, 8, 21),
       datetime.date(2019, 8, 30), datetime.date(2019, 8, 31),
       datetime.date(2019, 9, 6), datetime.date(2019, 9, 7),
       datetime.date(2019, 9, 17), datetime.date(2019, 9, 18),
       datetime.date(2019, 9, 24), datetime.date(2019, 10, 7),
       datetime.date(2019, 10, 9), datetime.date(2019, 10, 10),
       datetime.date(2019, 10, 15), datetime.date(2019, 10, 16),
       datetime.date(2019, 10, 17), datetime.date(2019, 10, 21),
       datetime.date(2019, 10, 30), datetime.date(2019, 11, 4),
       datetime.date(2019, 11, 14), datetime.date(2019, 11, 15),
       datetime.date(2019, 11, 16), datetime.date(2019, 11, 23),
       datetime.date(2019, 11, 24), datetime.date(2019, 11, 27),
       datetime.date(2019, 8, 6), datetime.date(2019, 8, 7),
       datetime.date(2019, 8, 8), datetime.date(2019, 8, 9),
       datetime.date(2019, 8, 12), datetime.date(2019, 8, 13),
       datetime.date(2019, 9, 1), datetime.date(2019, 9, 2),
       datetime.date(2019, 9, 8), datetime.date(2019, 9, 19),
       datetime.date(2019, 9, 30), datetime.date(2019, 11, 8),
       datetime.date(2019, 11, 7), datetime.date(2019, 11, 19),
       datetime.date(2019, 10, 25), datetime.date(2019, 11, 26),
       datetime.date(2019, 8, 16), datetime.date(2019, 8, 20),
       datetime.date(2019, 8, 26), datetime.date(2019, 8, 28),
       datetime.date(2019, 9, 9), datetime.date(2019, 9, 20),
       datetime.date(2019, 10, 1), datetime.date(2019, 9, 13),
       datetime.date(2019, 11, 6), datetime.date(2019, 8, 24),
       datetime.date(2019, 8, 25), datetime.date(2019, 9, 29),
       datetime.date(2019, 10, 19), datetime.date(2019, 10, 26),
       datetime.date(2019, 11, 2), datetime.date(2019, 10, 27),
       datetime.date(2019, 9, 15), datetime.date(2019, 10, 6),
       datetime.date(2019, 8, 17), datetime.date(2019, 9, 4),
       datetime.date(2019, 9, 14), datetime.date(2019, 9, 22),
       datetime.date(2019, 10, 12), datetime.date(2019, 10, 13),
       datetime.date(2019, 11, 3), datetime.date(2019, 10, 20),
       datetime.date(2019, 9, 21), datetime.date(2019, 9, 28),
       datetime.date(2019, 10, 5), datetime.date(2019, 11, 10),
       datetime.date(2019, 11, 17), datetime.date(2019, 11, 9)],
      dtype=object)
In [37]:
print('General data represents data for the period from', df_dataset['date'].min(), 'till', df_dataset['date'].max(), \
      '.\nThe duplicated rows came across the data set from the period', duplicated_rows['date'].min(), 'till', \
      duplicated_rows['date'].max(), '.' )
General data represents data for the period from 2019-08-02 00:00:00 till 2019-11-28 00:00:00 .
The duplicated rows came across the data set from the period 2019-08-05 00:00:00 till 2019-11-28 00:00:00 .

It means that there are duplicates throughout the whole period. Presumably that might have happened due to some server problems that continuously produced duplicated data.

In [38]:
#checking the operators whose data was duplicated:
duplicated_rows['operator_id'].value_counts()
Out[38]:
885876    36
893804    36
901880    35
891410    29
891414    28
          ..
919956     1
899268     1
891254     1
952656     1
895266     1
Name: operator_id, Length: 776, dtype: int64
In [39]:
#checking area of the problem caused duplicated data:
for i in duplicated_rows:
    print(i, ":", duplicated_rows[i].nunique())
user_id : 240
date : 114
direction : 2
internal : 2
operator_id : 776
is_missed_call : 2
calls_count : 170
call_duration : 1543
total_call_duration : 1787
In [40]:
df_dataset.query('operator_id==891410')
Out[40]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
4353 166511 2019-08-20 out False 891410 True 1 0 8
4354 166511 2019-08-20 out True 891410 True 1 0 0
4356 166511 2019-08-20 out False 891410 False 2 37 65
4361 166511 2019-08-21 out True 891410 False 2 21 30
4362 166511 2019-08-21 in False 891410 False 4 73 126
... ... ... ... ... ... ... ... ... ...
5189 166511 2019-11-27 out False 891410 True 12 0 45
5197 166511 2019-11-28 in False 891410 False 7 1634 1723
5199 166511 2019-11-28 out False 891410 True 8 0 114
5202 166511 2019-11-28 out False 891410 False 29 3429 3612
5204 166511 2019-11-28 in False 891410 False 7 1634 1723

290 rows × 9 columns

So far we cannot see any pattern in duplicates. There are data on different events, dates (114 days from 05.08.2019 till 21.09.2019), different operators (775 unique operators have duplicates in their data).

In [41]:
# length before dropping duplicates 
length1=  len(df_dataset) 
  
# dropping duplicate values 
df_dataset.drop_duplicates(keep=False,inplace=True) 
  
# length after removing duplicates 
length2=len(df_dataset) 
  
# printing all data frame lengths  
print('Length before dropping duplicates:', length1, '\nLength before dropping duplicates:', length2)
Length before dropping duplicates: 45730 
Length before dropping duplicates: 37362
In [42]:
print('Final data includes: \nunique users:', df_dataset['user_id'].nunique(),\
      '\nunique operators:', df_dataset['operator_id'].nunique())
Final data includes: 
unique users: 288 
unique operators: 1082

Conclusion

The data for analysis is splited between the 2 tables: telecom_dataset_us and telecom_clients. The first table is the main one. It contains:

  • 9 columns;
  • 53902 entries;
  • data on calls from 2019-08-02 till 2019-11-28 ;
  • unique users - 307;
  • unique operators - 1092;
  • and tariff_plans A, B and C.

    The data contained missing values in the following columns:
    -internal 0.21%
    -operator_id 15.16%

As one user can have more than 1 operator ( the number of operators per user varies from 1 to 50), we cannot use transform() method grouping by users to fill the missing valuses. We had no other option but to drop these part of the data as we cannot restore the missing values here.
As now we have only minor number of missing values in column 'internal' (<2% of the data set) we can drop them without scewing the data.
We have determined that there are 4179 duplicates in the general data set. Studying the duplicated rows we have not figured any pattern. there are duplicates throughout the whole period. Presumably that might have happened due to some server problems that continuously produced duplicated data.

All in all during preprocessing we have:

  • changed the data type for dates which improved the memory usage: from 7.3 MB to 5 MB for the main data set
  • dropped the missing values in two columns;
  • Changed the data type in the operator_id column for 'int'.

    Final dataset contains:
    • 37312 entries;
    • data on calls from 2019-08-02 till 2019-11-28;
    • unique users - 288;
    • unique operators - 1082.

Part 2. Exploratory data analysis (EDA)

Statistical summary of the data

In [43]:
#statistical summary for numerical variables
df_dataset.drop(['user_id', 'operator_id'], axis = 1).describe().round(2)
Out[43]:
calls_count call_duration total_call_duration
count 37362.0 37362.00 37362.00
mean 16.9 1009.77 1320.92
std 60.6 4123.89 4850.72
min 1.0 0.00 0.00
25% 1.0 0.00 67.00
50% 4.0 105.00 286.00
75% 13.0 768.00 1098.00
max 4817.0 144395.00 166155.00
In [44]:
df_dataset['direction'].value_counts()
Out[44]:
out    25909
in     11453
Name: direction, dtype: int64

Correlation matrix

In [45]:
#defining correlation between the features
corr = df_dataset.drop(['user_id', 'operator_id'] , axis = 1).corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
            cmap=sns.diverging_palette(220, 10, as_cmap=True))
plt.show()

The matrix shows that we have 2 pairs of itercorrelated features: call_duration with total_call_duration and calls_count with total_call_duration. We are going to substitude call_duration with total_call_duration features with average waiting time. This way we will get a set of 4 not correlated features and will set them as a criteria of effectiveness.

missed incoming calls (internal and external)

We will define ineffective operatos according to the 1st parameter - naumber of missed calls.

In [46]:
missed_calls = df_dataset.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls['is_missed_call'] = missed_calls['is_missed_call'].astype(int)
missed_calls
Out[46]:
operator_id is_missed_call
284 913942 27
22 885890 20
561 940588 15
587 944226 13
21 885876 13
... ... ...
295 914848 0
296 914870 0
298 915614 0
299 915958 0
738 973286 0

739 rows × 2 columns

In [47]:
plt.figure(figsize=(15,3))
missed_calls['is_missed_call'].plot.hist(density=True, bins= 27, color='#cce1e1', log=True)

plt.title('Number of missed calls')
plt.show()
In [48]:
# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="is_missed_call", data=missed_calls, color='#cce1e1') 
ax.set(xlabel='number of missed calls')
plt.show()
In [49]:
#Calculatinug the 95th and 99th percentiles for the number of missed calls (setting out outliers). 
np.percentile(missed_calls['is_missed_call'], [95, 99])
Out[49]:
array([ 4., 11.])
In [50]:
missed_calls['is_missed_call'].describe()
Out[50]:
count    739.000000
mean       0.847091
std        2.220172
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max       27.000000
Name: is_missed_call, dtype: float64
In [51]:
#operators with missed incoming calls:
operators_with_missed = missed_calls.query('is_missed_call>0')['operator_id']
operators_with_missed.values.tolist()
Out[51]:
[913942,
 885890,
 940588,
 944226,
 885876,
 913938,
 919554,
 940630,
 944222,
 944216,
 944220,
 901880,
 944218,
 931458,
 951508,
 940622,
 921818,
 908960,
 937956,
 940652,
 893804,
 952948,
 925922,
 905104,
 921318,
 940634,
 899250,
 940616,
 940658,
 882686,
 930020,
 938896,
 906404,
 919204,
 919206,
 918978,
 939706,
 921306,
 906408,
 948286,
 906410,
 882684,
 939370,
 937958,
 893402,
 905862,
 906396,
 919166,
 920706,
 958460,
 906406,
 919164,
 905542,
 900354,
 882680,
 915556,
 884408,
 903318,
 912296,
 916424,
 900892,
 887276,
 890406,
 920728,
 940614,
 929424,
 911102,
 896536,
 937368,
 903312,
 944764,
 899268,
 923526,
 905538,
 947304,
 923528,
 937732,
 906392,
 937860,
 937960,
 937736,
 904016,
 904038,
 900746,
 937962,
 906070,
 945046,
 905300,
 954650,
 937780,
 906412,
 964666,
 940624,
 934534,
 890404,
 889754,
 910226,
 888534,
 888532,
 911138,
 908958,
 911140,
 934098,
 911310,
 912010,
 940442,
 913788,
 882690,
 891410,
 902510,
 907952,
 907502,
 938414,
 906680,
 904044,
 911142,
 939376,
 937864,
 919218,
 944766,
 912684,
 906400,
 917850,
 944646,
 939476,
 948288,
 902910,
 939478,
 948756,
 940430,
 948758,
 902532,
 939708,
 913886,
 920902,
 909134,
 937868,
 937902,
 920666,
 917852,
 919794,
 905564,
 938614,
 918436,
 918888,
 941844,
 937898,
 910540,
 919552,
 911136,
 905430,
 945052,
 937966,
 907982,
 937874,
 920852,
 939224,
 914440,
 921320,
 894232,
 958452,
 958434,
 927912,
 927910,
 958416,
 893420,
 894120,
 894224,
 894230,
 921814,
 954750,
 957922,
 935958,
 926872,
 926490,
 895574,
 895576,
 895776,
 936514,
 955032,
 935348,
 928382,
 958640,
 959312,
 932460,
 932798,
 933070,
 933986,
 969272,
 934074,
 934076,
 930242,
 888406,
 964510,
 888868,
 929876,
 890228,
 890582,
 890618,
 891414,
 928888,
 960620,
 928886,
 896012,
 894614,
 925826,
 937432,
 899082,
 924936,
 924930,
 900194,
 900352,
 924928,
 900492,
 924546,
 900788,
 937760,
 937762,
 900894,
 951490,
 937808,
 901884,
 924948,
 937750,
 897872,
 924960,
 924954,
 898414,
 896538,
 898422]
In [52]:
operators_with_missed.shape
Out[52]:
(230,)

In general 739 operators have from 0 to 27 missed calls during the period. Middle 50% of them have from 0 to 1 missed call.
Only 5% of operators have over 4 missed calls, and only 1% of operators have over 11 missed calls. 509 out of the 739 operators do not have missed incoming calls.

We have received a list of 230 operators IDs that had missed calls during this period.

waiting time

In [53]:
waiting_calls = df_dataset.query('direction=="in"') 
                                 #and is_missed_call==True')
waiting_calls.head()
Out[53]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
26 166377 2019-08-12 in False 880028 False 1 407 411
34 166377 2019-08-13 in False 880028 False 1 88 102
37 166377 2019-08-14 in False 880026 False 2 197 218
39 166377 2019-08-14 in False 880028 False 1 33 37
49 166377 2019-08-15 in False 880028 False 1 23 27
In [54]:
waiting_calls_operators = df_dataset.query('direction=="in"').groupby('operator_id')\
.agg({'calls_count': 'sum', 'call_duration': 'sum', 'total_call_duration': 'sum'})\
.reset_index().sort_values(by='calls_count', ascending=False)

waiting_calls_operators.sample()
Out[54]:
operator_id calls_count call_duration total_call_duration
36 888538 1 566 590
In [55]:
waiting_calls_operators['waiting_time'] = waiting_calls_operators['total_call_duration'] - waiting_calls_operators['call_duration']
waiting_calls_operators['avg_waiting_time'] = waiting_calls_operators['waiting_time']/waiting_calls_operators['calls_count']
waiting_calls_operators.sample()
Out[55]:
operator_id calls_count call_duration total_call_duration waiting_time avg_waiting_time
51 890618 68 6965 7044 79 1.161765
In [56]:
waiting_calls_operators['avg_waiting_time'].describe()
Out[56]:
count    739.000000
mean      17.571811
std       12.173860
min        0.686391
25%        9.387500
50%       14.428571
75%       21.937220
max      115.500000
Name: avg_waiting_time, dtype: float64
In [57]:
plt.figure(figsize=(15,3))
waiting_calls_operators['avg_waiting_time'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Distribution of average wating time among operators')
plt.xlabel('Waiting time, sec.')
plt.show()

Destribution of average waiting_time is scewed with long tail to the right. There are lots of outliers but most of the distribution lies within 1 minute. Average waiting_time varies greatly from 1 sec to 115 sec. Median value is 14 sec. with standard deviation of 12. Middle 50% of operators do not pick up on average from 9 to 22 seconds.

Z-score

We will define how successful is every operator in relation to the mean value of his/her average waiting time. To do that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of avg wating time of an operator is identical to the mean score.

In [58]:
#counting z-score for sales to define distribution 
waiting_calls_operators['waiting_time_z']=(waiting_calls_operators['avg_waiting_time']-waiting_calls_operators['avg_waiting_time'].mean())/waiting_calls_operators['avg_waiting_time'].std()
waiting_calls_operators['operator_id'] = waiting_calls_operators['operator_id'].astype(str)
waiting_calls_operators.head(3)
Out[58]:
operator_id calls_count call_duration total_call_duration waiting_time avg_waiting_time waiting_time_z
283 913938 4311 316681 352891 36210 8.399443 -0.753448
661 952948 2972 334958 339609 4651 1.564939 -1.314856
284 913942 2293 192529 220098 27569 12.023114 -0.455788

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

In [59]:
#distinguishing z-score by colors: green for negative values (shorter waiting time), red for positive ones (longer waiting time)
waiting_calls_operators_z = waiting_calls_operators.query('-0.05 <=waiting_time_z<=0.08')\
.sort_values(by='avg_waiting_time', ascending=False).reset_index(drop=True)
waiting_calls_operators_z['colors'] = ['teal' if x<0 else 'red' for x in waiting_calls_operators_z['waiting_time_z']]
waiting_calls_operators_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=waiting_calls_operators_z.operator_id, xmin=0, xmax=waiting_calls_operators_z.waiting_time_z, colors=waiting_calls_operators_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators average waiting time relation to mean value of the waiting time \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least successful operators IDs that have the longest average waiting time (those with positive z-score):

In [60]:
operators_max_waiting = waiting_calls_operators.query('waiting_time_z>=0')['operator_id']
operators_max_waiting = operators_max_waiting.astype(int).values.tolist()
operators_max_waiting
Out[60]:
[919790,
 919554,
 919794,
 919552,
 908958,
 919792,
 908960,
 899082,
 906406,
 939370,
 887276,
 906404,
 951508,
 917852,
 921306,
 906070,
 911138,
 954750,
 907502,
 906410,
 944218,
 906408,
 893420,
 902532,
 944220,
 944216,
 911136,
 909134,
 944222,
 906680,
 882690,
 944226,
 896536,
 912010,
 910540,
 901886,
 934534,
 906396,
 906400,
 910018,
 923528,
 902510,
 905542,
 937732,
 904044,
 937962,
 944210,
 958460,
 938414,
 969272,
 936514,
 921320,
 971102,
 901350,
 911140,
 906398,
 888534,
 891906,
 907982,
 892532,
 905564,
 920706,
 924960,
 907952,
 900194,
 916618,
 907224,
 904016,
 920982,
 951506,
 920902,
 907994,
 935958,
 920874,
 893172,
 934428,
 962268,
 914440,
 899268,
 901894,
 891900,
 884408,
 910516,
 903512,
 918436,
 924928,
 944228,
 887282,
 924956,
 951492,
 895786,
 912684,
 900746,
 902910,
 934430,
 906076,
 908834,
 927070,
 958434,
 944644,
 958430,
 892534,
 898558,
 901492,
 905300,
 937368,
 924930,
 937374,
 894656,
 940458,
 925106,
 906394,
 919994,
 924950,
 924952,
 924954,
 911102,
 933832,
 948758,
 888868,
 898560,
 939478,
 884412,
 919214,
 892536,
 891946,
 922356,
 896538,
 909768,
 907970,
 911142,
 910958,
 958432,
 924948,
 921102,
 888532,
 894120,
 895782,
 934076,
 956484,
 908152,
 915958,
 925808,
 926490,
 896010,
 937742,
 958456,
 882688,
 944246,
 888406,
 924940,
 891646,
 892538,
 954290,
 924936,
 929340,
 947816,
 886674,
 938432,
 907174,
 951332,
 906402,
 939212,
 958478,
 896018,
 892530,
 934432,
 932500,
 921104,
 907440,
 924946,
 934098,
 952462,
 914842,
 904200,
 944474,
 956080,
 958468,
 944560,
 944648,
 904204,
 909392,
 928092,
 907504,
 917252,
 924934,
 913780,
 891744,
 901498,
 901038,
 937362,
 895140,
 951118,
 948284,
 934424,
 933996,
 950972,
 945058,
 914426,
 909624,
 908098,
 908130,
 924932,
 888540,
 964712,
 891918,
 969288,
 970254,
 882478,
 937366,
 882476,
 961552,
 899892,
 908162,
 897872,
 955164,
 904372,
 970244,
 899898,
 968368,
 905842,
 960296,
 964518,
 886892,
 888536,
 958470,
 952292,
 890226,
 973286,
 917680,
 932304,
 911152,
 911214,
 913474,
 918956,
 930510,
 914848,
 915614,
 940508,
 935244,
 935956,
 960294,
 940512,
 892798,
 920930,
 922114,
 923254,
 924662,
 972412,
 936382,
 970252,
 970240,
 969380,
 888538,
 929228,
 899912,
 907178,
 907618,
 907986,
 946468,
 947612,
 952678,
 899900,
 899906,
 908300,
 955436,
 900792,
 941628,
 905478,
 952916]
In [61]:
len(operators_max_waiting)
Out[61]:
270

This way we have received 270 operators IDs that have maximum average waiting time (longer or equal to the average one).

outgoing calls

Small number of outgoing calls is a sign of a bad operator.

In [62]:
outgoing_calls = df_dataset.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index().sort_values(by='calls_count', ascending=False)
outgoing_calls.sample()
Out[62]:
operator_id calls_count
788 955094 2
In [63]:
plt.figure(figsize=(15,3))
outgoing_calls['calls_count'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Number of outgoing calls')
plt.xlabel('number of calls')
plt.show()
In [64]:
# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="calls_count", data=outgoing_calls, color='#cce1e1') 
ax.set(xlabel='number of outgoing calls')
plt.show()
In [65]:
outgoing_calls['calls_count'].describe()
Out[65]:
count      873.000000
mean       626.967927
std       2824.727767
min          1.000000
25%         10.000000
50%         84.000000
75%        533.000000
max      53301.000000
Name: calls_count, dtype: float64
In [66]:
#Calculatinug the 95th and 99th percentiles for the number of outgoing calls (setting out outliers). 
np.percentile(outgoing_calls['calls_count'], [95, 99])
Out[66]:
array([2153.  , 4732.44])

In general 873 operators have from 1 to 53301 outgoing calls during the period. Middle 50% of them have from 10 to 533 outgoing calls with median value of 84 calls.
Only 5% of operators have over 2153 outgoing calls, and only 1% of operators have over 4732 outgoing calls.

Z-score

We will define how successful is every operator in relation to the mean value of number of calls he/she performed. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of calls of an operator is identical to the mean score.

In [67]:
#counting z-score for number of outgoing calls to define distribution 
outgoing_calls['calls_count_z']=(outgoing_calls['calls_count']-outgoing_calls['calls_count'].mean())/outgoing_calls['calls_count'].std()
outgoing_calls['operator_id'] = outgoing_calls['operator_id'].astype(str)
outgoing_calls.head(3)
Out[67]:
operator_id calls_count calls_count_z
24 885890 53301 18.647472
23 885876 51977 18.178754
462 929428 22936 7.897764

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

In [68]:
#distinguishing z-score by colors: red for negative values, green for positive ones 
outgoing_calls_z = outgoing_calls.query('-0.03 <=calls_count_z<=0.05').sort_values(by='calls_count', ascending=False).reset_index(drop=True)
outgoing_calls_z['colors'] = ['red' if x<0 else 'teal' for x in outgoing_calls_z['calls_count_z']]
outgoing_calls_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=outgoing_calls_z.operator_id, xmin=0, xmax=outgoing_calls_z.calls_count_z, colors=outgoing_calls_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators calls number relation to mean value of the call number \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least effective operators IDs that make the smallest number of outgoing calls (those with negative z-score):

In [69]:
operators_min_calls = outgoing_calls.query('calls_count_z<=0')['operator_id']
operators_min_calls = operators_min_calls.astype(int).values.tolist()
operators_min_calls 
Out[69]:
[940622,
 919390,
 953462,
 937780,
 941384,
 882684,
 923528,
 918436,
 928886,
 965542,
 938080,
 883940,
 939738,
 938074,
 939476,
 952914,
 906404,
 899788,
 937708,
 952114,
 901992,
 939474,
 924934,
 930020,
 937854,
 947304,
 905870,
 958468,
 905432,
 921306,
 925826,
 900826,
 906394,
 939370,
 912296,
 947652,
 891154,
 938072,
 928888,
 891988,
 947644,
 900354,
 958418,
 948182,
 913942,
 937788,
 965538,
 907986,
 906400,
 906396,
 896382,
 919128,
 925828,
 905574,
 913938,
 948758,
 958452,
 899964,
 927916,
 941390,
 952116,
 954642,
 914050,
 939218,
 962902,
 912722,
 940438,
 937862,
 947598,
 940596,
 899968,
 890422,
 944222,
 891160,
 905932,
 906412,
 918452,
 930914,
 903292,
 944646,
 947640,
 960294,
 936382,
 914238,
 888534,
 909910,
 919126,
 928518,
 932836,
 947600,
 958478,
 933806,
 958416,
 940614,
 919910,
 906294,
 919218,
 962658,
 905930,
 900352,
 964716,
 958430,
 938022,
 937988,
 924950,
 907224,
 890618,
 891152,
 924356,
 915360,
 939886,
 920852,
 940442,
 937810,
 887278,
 953464,
 918390,
 915358,
 895776,
 891158,
 920666,
 934426,
 937868,
 944764,
 921320,
 904202,
 902608,
 930586,
 969272,
 896016,
 924948,
 924960,
 944220,
 900492,
 900894,
 932498,
 924930,
 904038,
 907182,
 922474,
 920720,
 937732,
 924958,
 891166,
 900594,
 899972,
 922476,
 930912,
 896008,
 912252,
 922428,
 951506,
 924940,
 888532,
 910520,
 968338,
 880022,
 891170,
 948756,
 906296,
 937762,
 907952,
 905104,
 951648,
 915558,
 907970,
 960672,
 937812,
 901584,
 952460,
 920838,
 940432,
 917850,
 935348,
 906680,
 943066,
 944218,
 896428,
 951492,
 941628,
 932050,
 937980,
 944644,
 907440,
 937678,
 947590,
 920732,
 907178,
 916424,
 951508,
 960296,
 944766,
 959312,
 958434,
 896708,
 926312,
 914630,
 898906,
 958432,
 926490,
 939684,
 964510,
 944216,
 887802,
 924954,
 923250,
 937874,
 938612,
 940434,
 940952,
 908960,
 958460,
 892538,
 920706,
 933810,
 939478,
 958480,
 937774,
 900592,
 896390,
 947636,
 891162,
 895774,
 901586,
 937650,
 947612,
 900788,
 920728,
 905570,
 916618,
 910532,
 924952,
 907994,
 932292,
 907442,
 907982,
 910524,
 944226,
 933244,
 929228,
 919190,
 911310,
 968150,
 908958,
 921116,
 962268,
 960648,
 895172,
 940460,
 892532,
 920146,
 972410,
 950444,
 896386,
 891252,
 940630,
 897030,
 898892,
 921316,
 906416,
 958440,
 917856,
 890228,
 906408,
 938842,
 929544,
 954750,
 937156,
 929876,
 940458,
 943050,
 907180,
 972412,
 898414,
 924546,
 936324,
 907964,
 953458,
 940588,
 945728,
 930590,
 924956,
 919552,
 906070,
 950672,
 903312,
 969386,
 930594,
 919792,
 960620,
 958444,
 930186,
 901588,
 888868,
 890582,
 900790,
 904204,
 914338,
 915614,
 937604,
 886146,
 887992,
 900746,
 893402,
 937750,
 967310,
 888536,
 880240,
 906392,
 897872,
 925134,
 935264,
 958672,
 908078,
 920726,
 917680,
 952466,
 944210,
 924928,
 938414,
 921814,
 937864,
 930908,
 906398,
 937984,
 900194,
 939708,
 892798,
 965232,
 914240,
 944648,
 952656,
 937422,
 924946,
 966158,
 915504,
 901038,
 894224,
 932304,
 937430,
 955032,
 972460,
 935216,
 940456,
 937352,
 926184,
 891824,
 916596,
 880020,
 891946,
 891746,
 947818,
 900458,
 930524,
 938614,
 947816,
 910594,
 900614,
 965328,
 940430,
 890232,
 937898,
 901996,
 896384,
 935244,
 894662,
 945046,
 928228,
 907504,
 910926,
 884408,
 906420,
 929884,
 937856,
 955090,
 958454,
 882688,
 917846,
 940486,
 924932,
 930662,
 894226,
 958472,
 937782,
 956268,
 893420,
 898902,
 940474,
 946582,
 919794,
 919790,
 901178,
 951408,
 951290,
 908082,
 881278,
 919166,
 939736,
 929626,
 905934,
 904044,
 897894,
 894934,
 943818,
 930660,
 906410,
 918956,
 906402,
 932750,
 939716,
 894232,
 908002,
 924614,
 947610,
 894230,
 957022,
 915958,
 917446,
 944476,
 914636,
 909308,
 928926,
 917378,
 920928,
 940440,
 900742,
 955082,
 958470,
 920902,
 921102,
 946454,
 937870,
 951046,
 924544,
 937432,
 910902,
 956298,
 957922,
 922698,
 932088,
 937786,
 924370,
 891156,
 901034,
 944474,
 959118,
 958640,
 933996,
 888540,
 940788,
 892534,
 930910,
 913868,
 969294,
 898422,
 945052,
 896430,
 914848,
 939718,
 933986,
 970242,
 914168,
 934476,
 898434,
 917252,
 951650,
 953362,
 922114,
 941844,
 895370,
 969600,
 892536,
 919204,
 970254,
 891970,
 923254,
 919164,
 948150,
 905472,
 955820,
 941838,
 921104,
 959596,
 941736,
 960950,
 937368,
 892530,
 904200,
 907972,
 941740,
 919554,
 956276,
 887280,
 898920,
 909452,
 969286,
 937428,
 954378,
 918958,
 932798,
 926872,
 928382,
 914816,
 891908,
 970484,
 951186,
 939212,
 917682,
 952916,
 896018,
 896014,
 940568,
 905840,
 964712,
 900792,
 892800,
 942174,
 970486,
 888406,
 930692,
 940436,
 961552,
 940802,
 940782,
 914266,
 939236,
 947480,
 946988,
 969268,
 944228,
 904344,
 951332,
 913474,
 953460,
 954376,
 888538,
 972408,
 954380,
 884402,
 909134,
 932856,
 914870,
 937362,
 933452,
 921062,
 906418,
 930690,
 936296,
 893700,
 929622,
 909892,
 937900,
 932676,
 914426,
 914170,
 941842,
 890226,
 885682,
 937370,
 960670,
 937742,
 922486,
 918888,
 948170,
 955164,
 914348,
 905470,
 905308,
 908984,
 973120,
 901492,
 970252,
 893172,
 919536,
 932754,
 945898,
 896010,
 945074,
 891744,
 908718,
 927142,
 930814,
 967204,
 893400,
 928282,
 928284,
 907998,
 903978,
 970244,
 929542,
 930510,
 969288,
 962904,
 883018,
 934188,
 934000,
 960674,
 891250,
 917876,
 891976,
 955068,
 899076,
 895598,
 940842,
 940848,
 941810,
 919418,
 940814,
 955094,
 950806,
 956292,
 944560,
 945058,
 924572,
 966686,
 944246,
 932246,
 914626,
 914440,
 969262,
 941826,
 930816,
 890234,
 952196,
 969284,
 945274,
 954086,
 930326,
 895266,
 970240,
 946072,
 970250,
 947596,
 970258,
 891918,
 908080,
 947820,
 949420,
 891192,
 882478,
 902238,
 935246,
 937770,
 914272,
 937956,
 939698,
 940610,
 937872,
 905480,
 958458,
 900192,
 914036,
 905862,
 906076,
 913984,
 937778,
 937772,
 955086,
 935548,
 937720,
 918988,
 937716,
 937710,
 919214,
 940828,
 937366,
 941816,
 883898,
 954318,
 963716,
 919206,
 914256,
 935956,
 954284]
In [70]:
len(operators_min_calls)
Out[70]:
674

This way we have received 674 operators IDs that perform fewer number of outgoing calls (smaller than the average number).

internal calls

The less internal calls an operator has the more effective it is considered to be.

In [71]:
internal_calls = df_dataset.query('internal==True and direction=="out"').groupby('operator_id')['calls_count']\
.sum().reset_index().sort_values(by='calls_count', ascending=False)
internal_calls.head()
Out[71]:
operator_id calls_count
163 907986 393
135 906394 366
128 905574 307
136 906396 299
126 905566 290
In [72]:
plt.figure(figsize=(15,3))
internal_calls['calls_count'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Number of internal calls')
plt.show()
In [73]:
# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="calls_count", data=internal_calls, color='#cce1e1') 
ax.set(xlabel='number of internal calls')
plt.show()
In [74]:
internal_calls['calls_count'].describe()
Out[74]:
count    486.000000
mean      24.411523
std       52.832700
min        1.000000
25%        2.000000
50%        5.000000
75%       18.750000
max      393.000000
Name: calls_count, dtype: float64
In [75]:
#Calculatinug the 95th and 99th percentiles for the number of outgoing calls (setting out outliers). 
np.percentile(internal_calls['calls_count'], [95, 99])
Out[75]:
array([135. , 279.8])

In general 486 operators had from 1 to 393 internal outgoing calls during the period. Middle 50% of them had from 2 to 18 internal calls with median value of 5 calls.
Only 5% of operators had over 135 internal calls, and 1% of operators had over 280 outgoing calls.

Z-score

We will define how effective every operator is in relation to the number of his/her internal calls. To do that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of internal calls of an operator is identical to the mean score.

In [76]:
#counting z-score for sales to define distribution 
internal_calls['internal_calls_z']=(internal_calls['calls_count']-internal_calls['calls_count'].mean())/internal_calls['calls_count'].std()
internal_calls['operator_id'] = internal_calls['operator_id'].astype(str)
internal_calls.head(3)
Out[76]:
operator_id calls_count internal_calls_z
163 907986 393 6.976522
135 906394 366 6.465475
128 905574 307 5.348742

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

In [77]:
#distinguishing z-score by colors: green for negative values (fewer internal calls), red for positive ones (more internal calls)
internal_calls_z = internal_calls.query('-0.15 <=internal_calls_z<=0.2')\
.sort_values(by='internal_calls_z', ascending=False).reset_index(drop=True)
internal_calls_z['colors'] = ['teal' if x<0 else 'red' for x in internal_calls_z['internal_calls_z']]
internal_calls_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=internal_calls_z.operator_id, xmin=0, xmax=internal_calls_z.internal_calls_z, colors=internal_calls_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators internal calls relation to mean value of the internal calls \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least effective operators IDs that have the biggest number of internal outgoing calls (those with positive z-score):

In [78]:
operators_max_internal = internal_calls.query('internal_calls_z>=0')['operator_id']
operators_max_internal = operators_max_internal.astype(int).values.tolist()
operators_max_internal
Out[78]:
[907986,
 906394,
 905574,
 906396,
 905566,
 906406,
 887276,
 906412,
 905538,
 915358,
 906400,
 887278,
 924356,
 896008,
 905564,
 907182,
 918452,
 924934,
 906294,
 888532,
 951506,
 907952,
 907970,
 915360,
 906296,
 912722,
 951508,
 891414,
 907440,
 924950,
 905542,
 901884,
 958430,
 905570,
 906404,
 907994,
 888534,
 909910,
 951492,
 907982,
 924960,
 958434,
 952468,
 958440,
 891410,
 917850,
 896016,
 907224,
 906416,
 905104,
 932836,
 958460,
 906408,
 924958,
 879896,
 922476,
 907964,
 903318,
 959312,
 924930,
 958444,
 907178,
 922474,
 962658,
 916424,
 888536,
 891416,
 907442,
 921316,
 906392,
 958432,
 924954,
 900180,
 958672,
 924940,
 924948,
 924952,
 958452,
 902744,
 907180,
 901880,
 894224,
 962902,
 940616,
 953462,
 937430,
 921116,
 953464,
 906398,
 924928,
 937422,
 920146,
 893804,
 952948,
 944648,
 919218,
 906680,
 952458,
 906420]
In [79]:
len(operators_max_internal)
Out[79]:
99

This way we have received 99 operators IDs that have maximal number of internal calls (bigger or equal to the average one).

All in all we have 4 lists of operators IDs that perform ineffectively in 4 sphears:

  • miss calls,
  • do not pick up for a longer time;
  • do not make anough of outgoing calls;
  • call often to their colleagues.

On the next step we will group these IDs into 3 groups.

Ineffective Operators

1. The least effective operators:

Defining operators that performed poorly in all 4 spheres (their IDs got into every of 4 lists)

In [80]:
#Operators that are in all 4 lists:
worst_operators = set(operators_with_missed) & set(operators_max_waiting)\
& set(operators_min_calls) &set(operators_max_internal)
print('The number of least effective operators:', len(worst_operators))
The number of least effective operators: 17
In [81]:
#IDs
display(worst_operators)
{888532,
 888534,
 906396,
 906400,
 906404,
 906408,
 906680,
 907952,
 907982,
 924928,
 924930,
 924948,
 924954,
 924960,
 951508,
 958434,
 958460}

There are 17 the least effective operators: they have missed calls, their waiting time is above average, number of their outgoing calls is less than average and finally they call a lot to their colleagues.
Denerally such performance might be caused by lack of experience and qualification.
Probably these operators need some additional taining if they already work for the company long enough. If they are newcomers some supervision and proffetional guidance might help improve their performance.

2. Operators of lower effectiveness:

Defining operators that performed poorly in 2 out of 4 spheres (their IDs got into 2 lists). We think that the most important criterion of effectivness are number of outgoing calls and waiting time.
We suppose that missed calls do not define effectiveness of an operator much because operators do not tend to miss the calls at all. Middle 50% of them have from 0 to 1 missed call and only 5% of operators have over 4 missed calls. This cannot be crucial for the current analysis.
As to the internal calls, it can also be omitted at this stage. There are only 99 operators who make them often. Most of them could be newcomers.
We will define operators that got into these two lists of ineffectiveness (number of outgoing calls and waiting time):

In [82]:
#Operators that are in 2 lists:
bad_operators = set(operators_max_waiting) & set(operators_min_calls)
print('The number of operators of lower effectivness:', len(bad_operators))
The number of operators of lower effectivness: 155

3. Operators of average effectiveness:

Defining operators that performed poorly in 1 of 4 spheres (their IDs got into 1 of 4 lists)

In [83]:
#Operators that are at least in one of the lists:
#joining the 4 lists 

avg_effective_operators = list(set(operators_with_missed) | set(operators_max_waiting) | set(operators_min_calls) |set(operators_max_internal))

len(avg_effective_operators)
Out[83]:
857
In [84]:
display(avg_effective_operators)
[937984,
 919552,
 882688,
 919554,
 958468,
 937988,
 958470,
 897030,
 958472,
 954376,
 954378,
 882690,
 954380,
 899076,
 958478,
 899082,
 958480,
 935956,
 935958,
 966686,
 888868,
 938022,
 915504,
 901178,
 964666,
 948284,
 948286,
 948288,
 913474,
 956484,
 909392,
 944210,
 905300,
 929876,
 938072,
 944216,
 938074,
 944218,
 929884,
 905308,
 942174,
 944220,
 938080,
 944222,
 933986,
 962658,
 944226,
 915556,
 944228,
 915558,
 964712,
 933996,
 960620,
 886892,
 964716,
 925808,
 934000,
 944246,
 903292,
 925826,
 925828,
 960648,
 952458,
 909452,
 952460,
 952462,
 903312,
 952466,
 952468,
 903318,
 915614,
 960670,
 960672,
 960674,
 895140,
 927910,
 927912,
 950444,
 927916,
 958640,
 917680,
 917682,
 907440,
 899250,
 907442,
 934074,
 934076,
 899268,
 895172,
 958672,
 934098,
 932050,
 905430,
 921814,
 905432,
 921818,
 925922,
 930020,
 901350,
 907502,
 919790,
 919792,
 907504,
 919794,
 893172,
 932088,
 905470,
 905472,
 905478,
 905480,
 891152,
 954642,
 891154,
 891156,
 946454,
 891158,
 891160,
 954650,
 891162,
 891166,
 891170,
 895266,
 946468,
 934188,
 909624,
 891192,
 973120,
 905538,
 905542,
 883018,
 952656,
 962902,
 917846,
 962904,
 903512,
 917850,
 944474,
 917852,
 905564,
 971102,
 928092,
 917856,
 944476,
 905570,
 907618,
 905566,
 905574,
 952678,
 919910,
 936296,
 891250,
 901492,
 913780,
 917876,
 891252,
 901498,
 913788,
 954750,
 936324,
 930186,
 895370,
 940430,
 950672,
 940432,
 940434,
 940436,
 946582,
 940438,
 932246,
 940440,
 940442,
 940456,
 940458,
 940460,
 938414,
 944560,
 960950,
 919994,
 940474,
 936382,
 938432,
 930242,
 932292,
 940486,
 909768,
 913868,
 901584,
 932304,
 901586,
 901588,
 893400,
 893402,
 940508,
 913886,
 940512,
 928228,
 973286,
 926184,
 893420,
 887276,
 887278,
 887280,
 887282,
 915958,
 922114,
 944644,
 944646,
 944648,
 913938,
 891410,
 948756,
 913942,
 950806,
 934424,
 930326,
 934426,
 948758,
 934428,
 891414,
 934430,
 940568,
 934432,
 928282,
 928284,
 967204,
 940588,
 969262,
 940596,
 969268,
 969272,
 913984,
 936514,
 940610,
 909892,
 969284,
 940614,
 969286,
 940616,
 969288,
 934476,
 940622,
 969294,
 940624,
 952914,
 920146,
 952916,
 909910,
 940630,
 895574,
 895576,
 940634,
 957022,
 926312,
 940652,
 932460,
 895598,
 905840,
 965232,
 905842,
 940658,
 938612,
 914036,
 938614,
 952948,
 944764,
 881278,
 944766,
 928382,
 914050,
 905862,
 934534,
 912010,
 905870,
 967310,
 959118,
 932498,
 932500,
 955032,
 969380,
 969386,
 907952,
 950972,
 955068,
 907964,
 910018,
 907970,
 924356,
 907972,
 905930,
 955082,
 905932,
 899788,
 930510,
 905934,
 955086,
 907982,
 955090,
 965328,
 907986,
 924370,
 955094,
 907994,
 930524,
 907998,
 908002,
 940782,
 922356,
 940788,
 901880,
 914168,
 914170,
 901884,
 901886,
 891646,
 940802,
 893700,
 901894,
 951046,
 928518,
 940814,
 926490,
 930586,
 955164,
 940828,
 930590,
 895774,
 895776,
 930594,
 895782,
 895786,
 903978,
 946988,
 940842,
 908078,
 940848,
 908080,
 908082,
 899892,
 899898,
 922428,
 899900,
 914238,
 914240,
 908098,
 899906,
 932676,
 899912,
 951118,
 959312,
 897872,
 914256,
 904016,
 906070,
 938842,
 914266,
 906076,
 914272,
 891744,
 908130,
 891746,
 930660,
 930662,
 897894,
 901992,
 904038,
 922474,
 893804,
 922476,
 901996,
 904044,
 918390,
 922486,
 908152,
 899964,
 912252,
 899968,
 924544,
 930690,
 924546,
 930692,
 908162,
 969600,
 899972,
 932750,
 938896,
 910226,
 951186,
 932754,
 945046,
 940952,
 889754,
 945052,
 924572,
 914338,
 945058,
 918436,
 965538,
 965542,
 912296,
 914348,
 882684,
 891824,
 885682,
 945074,
 918452,
 932798,
 924614,
 916424,
 943050,
 943066,
 932836,
 924662,
 932856,
 914426,
 951290,
 887802,
 891900,
 930814,
 930816,
 891906,
 891908,
 914440,
 904200,
 896008,
 896010,
 904202,
 896012,
 904204,
 891918,
 908300,
 961552,
 896014,
 896018,
 896016,
 953362,
 951332,
 891946,
 955436,
 906294,
 906296,
 891970,
 891976,
 922698,
 891988,
 900180,
 920666,
 930908,
 930910,
 902238,
 930912,
 900192,
 930914,
 900194,
 947304,
 959596,
 951408,
 953458,
 885876,
 916596,
 953460,
 928886,
 953462,
 928888,
 953464,
 945274,
 885890,
 920706,
 963716,
 916618,
 920720,
 920726,
 906392,
 926872,
 906394,
 920728,
 906396,
 904344,
 906398,
 920732,
 906400,
 928926,
 906402,
 906404,
 906406,
 906408,
 894120,
 906410,
 906412,
 949420,
 906416,
 906418,
 910516,
 906420,
 904372,
 910520,
 887992,
 883898,
 910524,
 951490,
 914626,
 951492,
 910532,
 914630,
 937156,
 910540,
 939212,
 914636,
 933070,
 951506,
 939218,
 951508,
 939224,
 958456,
 939236,
 883940,
 958460,
 900352,
 924928,
 910594,
 900354,
 924930,
 924932,
 924934,
 920838,
 924936,
 924940,
 894224,
 894226,
 924946,
 920852,
 924948,
 894230,
 924950,
 894232,
 947480,
 924952,
 924954,
 924956,
 879896,
 924958,
 924960,
 920874,
 912684,
 935216,
 920902,
 941384,
 935244,
 935246,
 941390,
 912722,
 927070,
 951648,
 935264,
 951650,
 920928,
 920930,
 918888,
 939370,
 900458,
 902510,
 898414,
 939376,
 890226,
 890228,
 898422,
 890232,
 890234,
 933244,
 896382,
 914816,
 896384,
 886146,
 898434,
 902532,
 896386,
 947590,
 896390,
 937352,
 947596,
 900492,
 947598,
 947600,
 937362,
 880020,
 920982,
 937366,
 937368,
 880022,
 914842,
 947610,
 947612,
 937370,
 937374,
 914848,
 927142,
 918956,
 955820,
 918958,
 896428,
 896430,
 908718,
 925106,
 935348,
 947636,
 914870,
 906680,
 947640,
 947644,
 918978,
 947652,
 918988,
 929228,
 925134,
 937422,
 902608,
 939474,
 939476,
 937428,
 939478,
 968150,
 937430,
 937432,
 957922,
 921062,
 900592,
 900594,
 898558,
 898560,
 970240,
 970242,
 970244,
 900614,
 970250,
 970252,
 921102,
 970254,
 921104,
 966158,
 970258,
 896536,
 896538,
 921116,
 908834,
 890404,
 890406,
 890422,
 910902,
 929340,
 941628,
 945728,
 933452,
 910926,
 919126,
 888406,
 919128,
 902744,
 947816,
 947818,
 947820,
 910958,
 880240,
 923250,
 892530,
 892532,
 923254,
 892534,
 972408,
 892536,
 972410,
 892538,
 919164,
 972412,
 919166,
 935548,
 931458,
 937604,
 900742,
 900746,
 929424,
 968338,
 894614,
 919190,
 908958,
 908960,
 919204,
 939684,
 919206,
 941736,
 972460,
 941740,
 919214,
 956080,
 968368,
 919218,
 939698,
 937650,
 884402,
 900788,
 900790,
 884408,
 900792,
 939706,
 908984,
 939708,
 884412,
 894656,
 939716,
 896708,
 894662,
 939718,
 943818,
 937678,
 888532,
 890582,
 888534,
 939736,
 888536,
 939738,
 921306,
 888538,
 962268,
 888540,
 900826,
 921316,
 921318,
 954086,
 921320,
 945898,
 937708,
 909308,
 937710,
 941810,
 970484,
 937716,
 970486,
 937720,
 941816,
 890618,
 911102,
 902910,
 941826,
 937732,
 917252,
 929542,
 937736,
 929544,
 937742,
 941838,
 941842,
 941844,
 937750,
 900892,
 900894,
 911136,
 937760,
 911138,
 937762,
 911140,
 911142,
 960294,
 960296,
 937770,
 882476,
 937772,
 882478,
 937774,
 911152,
 952114,
 937778,
 952116,
 937780,
 937782,
 937786,
 937788,
 891416,
 898892,
 909134,
 937808,
 937810,
 937812,
 898902,
 929622,
 898906,
 929626,
 919390,
 898920,
 956268,
 911214,
 939886,
 956276,
 919418,
 892798,
 937854,
 937856,
 892800,
 917378,
 956292,
 952196,
 923526,
 937860,
 923528,
 937862,
 956298,
 937864,
 937868,
 937870,
 905104,
 937872,
 886674,
 937874,
 946072,
 964510,
 915358,
 915360,
 907174,
 964518,
 907178,
 937898,
 907180,
 901034,
 907182,
 937900,
 954284,
 937902,
 901038,
 933806,
 954290,
 933810,
 948150,
 917446,
 933832,
 948170,
 911310,
 954318,
 958416,
 958418,
 948182,
 894934,
 907224,
 958430,
 958432,
 958434,
 952292,
 937956,
 937958,
 937960,
 958440,
 937962,
 958444,
 937966,
 919536,
 958452,
 958454,
 882680,
 958458,
 937980,
 882686]
In [85]:
#all_operators= df_dataset.groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
#missed_calls['is_missed_call'] = missed_calls['is_missed_call'].astype(int)
#missed_calls
In [86]:
#list of all operators:
all_operators = df_dataset['operator_id'].unique()
all_operators = all_operators.tolist()
In [87]:
#operators that never got to any of the lists:
good_operators = list(set(all_operators) - set(avg_effective_operators))
good_operators
Out[87]:
[930818,
 930820,
 949252,
 926214,
 962062,
 923666,
 896020,
 936984,
 922142,
 890402,
 961064,
 968232,
 922154,
 890410,
 891948,
 890412,
 945710,
 890416,
 890420,
 929332,
 928312,
 948282,
 956480,
 902720,
 889410,
 909894,
 884294,
 952392,
 909896,
 931914,
 953940,
 903254,
 938070,
 922710,
 902742,
 895578,
 971354,
 904284,
 902746,
 938078,
 902748,
 923744,
 953952,
 901734,
 945308,
 906866,
 905844,
 898676,
 902774,
 887416,
 902776,
 952954,
 919162,
 892028,
 902778,
 945278,
 902782,
 945280,
 962168,
 945282,
 914052,
 945284,
 942214,
 945286,
 952968,
 941192,
 945288,
 945290,
 945294,
 945296,
 945298,
 929426,
 928404,
 919188,
 932502,
 952982,
 945304,
 945302,
 929428,
 919192,
 919194,
 919196,
 945310,
 919198,
 945312,
 919200,
 945314,
 919202,
 927908,
 945316,
 945318,
 944808,
 945320,
 945322,
 945324,
 936110,
 944310,
 960698,
 909502,
 895170,
 910530,
 907974,
 888520,
 899790,
 934608,
 884946,
 910038,
 939222,
 941786,
 910044,
 903390,
 946916,
 940772,
 883942,
 951526,
 945894,
 893674,
 961770,
 945900,
 945902,
 945904,
 917234,
 939762,
 956664,
 954616,
 902906,
 884478,
 911104,
 919302,
 919306,
 944908,
 919310,
 919314,
 926486,
 930582,
 919318,
 937752,
 934170,
 879898,
 889638,
 903974,
 923944,
 889640,
 903976,
 948524,
 884524,
 965942,
 951614,
 952126,
 932672,
 919362,
 919364,
 919370,
 919372,
 904014,
 919374,
 919376,
 919378,
 919382,
 919896,
 952666,
 971100,
 899932,
 920414,
 920416,
 908640,
 917858,
 919906,
 935268,
 946020,
 950632,
 950634,
 950640,
 946032,
 904058,
 947592,
 928142,
 910224,
 919956,
 908180,
 919958,
 947604,
 937880,
 880026,
 880028,
 937372,
 952734,
 947614,
 947616,
 919456,
 947618,
 937888,
 965540,
 947620,
 940968,
 965544,
 919464,
 933804,
 925104,
 944564,
 919476,
 947638,
 954810,
 947642,
 919482,
 958394,
 947646,
 935870,
 947648,
 947650,
 919490,
 947654,
 928202,
 918986,
 947658,
 939470,
 919504,
 930264,
 940514,
 935394,
 921574,
 921584,
 921592,
 921594,
 921596]
In [88]:
len(good_operators)
Out[88]:
225

There are 225 operators that never got to any of the lists; they can be considered as the most effective ones.

In [89]:
#Creating a db for visualisation of results:
data_op=[{'avg_effective_operators': len(avg_effective_operators), 'bad_operators':len(bad_operators), \
          'worst_operators': len(worst_operators)
         }]
operators_funnel = pd.DataFrame(data_op).T.reset_index()
operators_funnel.columns=['operators_groups', 'operators_number']

operators_funnel
Out[89]:
operators_groups operators_number
0 avg_effective_operators 857
1 bad_operators 155
2 worst_operators 17
In [90]:
#chart with 4 groups :

fig = go.Figure(go.Funnel(y = operators_funnel.operators_groups, x = operators_funnel.operators_number, 
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#9467bd', '#3ac7c3' , '#72B7B2', '#c6f7e7', '#C9FBE5']}))
                                                 
fig.show()

Combining the 4 lists of operators with the least effective performace in one of the spheres we have received 3 groups:

  1. The least effective operators - worst_operators - list of 17 IDs
  2. Operators of lower effectivness - bad_operators - list of 155
  3. Operators of average effectivness - avg_effective_operators - list of 857.

Studying the features for different tarrif plans

Merging the tables

In [91]:
#general db (merged)
df = df_dataset.merge(df_cl, on='user_id')
df.head(3)
Out[91]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration tariff_plan date_start
0 166377 2019-08-05 out True 880022 True 3 0 5 B 2019-08-01
1 166377 2019-08-05 out True 880020 True 1 0 1 B 2019-08-01
2 166377 2019-08-05 out True 880020 False 1 10 18 B 2019-08-01
In [92]:
#Share of missed calls per tariff 
tariffs_missed_calls = df.groupby('tariff_plan')['is_missed_call']\
.mean().reset_index().sort_values(by='is_missed_call', ascending=False)\
.rename({'is_missed_call': 'missed_calls_share'}, axis=1)
tariffs_missed_calls
Out[92]:
tariff_plan missed_calls_share
0 A 0.353561
1 B 0.345068
2 C 0.323746
In [93]:
#Share of internal calls per tariff 
tariffs_internal_calls = df.groupby('tariff_plan')['internal'].mean().reset_index() 
tariffs_internal_calls
Out[93]:
tariff_plan internal
0 A 0.161791
1 B 0.108376
2 C 0.128192
In [94]:
#db for graphical interpretation of the features distribution
df_graphs = df
In [95]:
#db ajustment for graphical interpretation
df_graphs['internal'] = df_graphs['internal'].replace(to_replace=True, value ="internal").replace(to_replace=False, value ="external")
df_graphs['is_missed_call'] = df_graphs['is_missed_call'].replace(to_replace=True, value ="missed_call").replace(to_replace=False, value ="answered")
df_graphs.insert(0, "count", 1)
df_graphs.head()
Out[95]:
count user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration tariff_plan date_start
0 1 166377 2019-08-05 out internal 880022 missed_call 3 0 5 B 2019-08-01
1 1 166377 2019-08-05 out internal 880020 missed_call 1 0 1 B 2019-08-01
2 1 166377 2019-08-05 out internal 880020 answered 1 10 18 B 2019-08-01
3 1 166377 2019-08-05 out external 880022 missed_call 3 0 25 B 2019-08-01
4 1 166377 2019-08-05 out external 880020 answered 2 3 29 B 2019-08-01
In [96]:
#plotting features distribution
parameters_hist = ['tariff_plan', 'direction', 'internal', 'is_missed_call']
parameters_dis = ['date_start']

for x in parameters_hist:
    plt.figure(figsize=(18,5))
    ax = sns.barplot(x=x, y="count", data=df_graphs, hue='tariff_plan', estimator=sum, palette="mako") 
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    plt.title('Distribution of {}'.format(x))        
    plt.show()
            
for x in parameters_dis:
    plt.figure(figsize=(18,5))
    sns.kdeplot(data=df_graphs, x=x, hue="tariff_plan", alpha = 0.6 , palette="mako", fill=True)       
    plt.show()

Distibution of the parameter 'calls_count'

In [97]:
df_graphs
Out[97]:
count user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration tariff_plan date_start
0 1 166377 2019-08-05 out internal 880022 missed_call 3 0 5 B 2019-08-01
1 1 166377 2019-08-05 out internal 880020 missed_call 1 0 1 B 2019-08-01
2 1 166377 2019-08-05 out internal 880020 answered 1 10 18 B 2019-08-01
3 1 166377 2019-08-05 out external 880022 missed_call 3 0 25 B 2019-08-01
4 1 166377 2019-08-05 out external 880020 answered 2 3 29 B 2019-08-01
... ... ... ... ... ... ... ... ... ... ... ... ...
37357 1 168606 2019-11-08 in external 957922 answered 2 686 705 C 2019-10-31
37358 1 168606 2019-11-09 out external 957922 answered 4 551 593 C 2019-10-31
37359 1 168606 2019-11-10 out internal 957922 answered 1 0 25 C 2019-10-31
37360 1 168606 2019-11-10 out internal 957922 missed_call 1 0 38 C 2019-10-31
37361 1 168606 2019-11-11 out internal 957922 answered 2 479 501 C 2019-10-31

37362 rows × 12 columns

In [98]:
#distribution of calls_count
def logx_kde(x, y, xmin, xmax, ymin, ymax): #a function that applies the logarithm to the x-data
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='calls_count', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')
In [99]:
# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="calls_count", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='number of calls')
plt.show()

Distibution of the parameter 'call_duration'

In [100]:
#distribution of call_duration
def logx_kde(x, y, xmin, xmax, ymin, ymax):
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='call_duration', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')
In [101]:
# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="call_duration", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='call duration, sec.')
plt.show()

Distibution of the parameter 'total_call_duration'

In [102]:
#distribution of total_call_duration
def logx_kde(x, y, xmin, xmax, ymin, ymax):
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='total_call_duration', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')
In [103]:
# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="total_call_duration", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='total call duration, sec.')
plt.show()

Description of features distribution.

There are 3 types of tarrif plan. The biggest number of calls have been done on tarrif plan C (13293 calls), slightly less on plan B (13024) and the smallest number - on plan A (10995).
The users used all 3 plans from beginning of the August. The beggerst number of users registrated for plan B from 1/08/2019 till 18/08/2019; Peack of registrations for plan C was during 15/08/2019-01/09/2019;
peak of registrations for plan A was the shortest: from 15/10/2019 till 20/11/2015.
The biggest number of outgoing calls are made on tariff B; the smallest number is on plan A. The biggest number and the biggest share of incoming call sare made on plan C.
The operators on plan B have the lowest share of internal calls (11%) and operators on plan A have the highest share of them (16%). It might be a sign of higher professionalism of those who work with plan B and vice versa the lower professional level of operators on plan A.

Plan C has the biggest number of answered calls, A - the lowest. Plan A has the highest share of missed calls (35%), plan C has the lowest share of them (32%).
For all tariff plans share of missed calls is pretty high, every third call gets no answer. Phone company might think of employing additional operators to get more calls answered and lower waiting time.
Another option may be outsoursing operators during the peak periods, e.g. when a company runs some marketing campaigns and expects a large number of registrations and new-commers.
Calls count. Operators on plan B make the biggest number of calls, C makes the fewest. A and C have long tail to the right and lots of of outliers.
Call_duration. Call duration looks standart for all tariffs but tariff plan A has a very long tail to the right with great amount of outliers.
Total_call_duration. Distribution of the feature looks similar to the call_duration distribution, but average duration for tariff A is higher.

Parameters of efficiency by tariff plans

In [104]:
#creating a separate db
df_tariffs = df
#substituting words by numbers for chart
df_tariffs['is_missed_call'] = np.where(df_tariffs['is_missed_call'] == "missed_call", 1, 0).astype('int', copy=False)
#adding parameter of waiting time:
df_tariffs['avg_waiting_time'] = round(((df_tariffs['total_call_duration'] - df_tariffs['call_duration']) /df_tariffs['calls_count']), 2)
df_tariffs.head()
Out[104]:
count user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration tariff_plan date_start avg_waiting_time
0 1 166377 2019-08-05 out internal 880022 1 3 0 5 B 2019-08-01 1.67
1 1 166377 2019-08-05 out internal 880020 1 1 0 1 B 2019-08-01 1.00
2 1 166377 2019-08-05 out internal 880020 0 1 10 18 B 2019-08-01 8.00
3 1 166377 2019-08-05 out external 880022 1 3 0 25 B 2019-08-01 8.33
4 1 166377 2019-08-05 out external 880020 0 2 3 29 B 2019-08-01 13.00
In [105]:
df_A = df_tariffs.query('tariff_plan == "A"')
df_B = df_tariffs.query('tariff_plan == "B"')
df_C = df_tariffs.query('tariff_plan == "C"')
In [106]:
df_list = [df_A, df_B, df_C]
In [107]:
df_A.name = 'df_A'
df_B.name = 'df_B'
df_C.name = 'df_C'
In [108]:
for i in df_list:
    print(i.name, i.shape[0])
df_A 11008
df_B 13038
df_C 13316

1. missed calls by tariff

In [109]:
#List of operators with missed calls for tariff A
missed_calls_A = df_A.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index()
missed_calls_A['is_missed_call'] = missed_calls_A['is_missed_call'].astype(int)
operators_with_missed_A = missed_calls_A.query('is_missed_call>0')['operator_id']
operators_with_missed_A.values.tolist()
print('Number of operators with missed calls for tariff A:', len(operators_with_missed_A))
Number of operators with missed calls for tariff A: 71
In [110]:
#List of operators with missed calls for tariff B
missed_calls_B = df_B.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls_B['is_missed_call'] = missed_calls_B['is_missed_call'].astype(int)
operators_with_missed_B = missed_calls_B.query('is_missed_call>0')['operator_id'].values.tolist()
print('Number of operators with missed calls for tariff B:', len(operators_with_missed_B))
Number of operators with missed calls for tariff B: 71
In [111]:
#List of operators with missed calls for tariff C
missed_calls_C = df_C.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls_C['is_missed_call'] = missed_calls_C['is_missed_call'].astype(int)
operators_with_missed_C = missed_calls_C.query('is_missed_call>0')['operator_id']
operators_with_missed_C.values.tolist()
print('Number of operators with missed calls for tariff C:', len(operators_with_missed_C))
Number of operators with missed calls for tariff C: 88

2. waiting time by tariff

In [112]:
#A
waiting_calls_operators_A = df_A.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_A['waiting_time_z']=(waiting_calls_operators_A['avg_waiting_time']-waiting_calls_operators_A['avg_waiting_time'].mean())/waiting_calls_operators_A['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_A = waiting_calls_operators_A.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff A:', len(operators_max_waiting_A))
Number of operators with long waiting time for tariff A: 55
In [113]:
#B
waiting_calls_operators_B = df_B.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_B['waiting_time_z']=(waiting_calls_operators_B['avg_waiting_time']\
                                             -waiting_calls_operators_B['avg_waiting_time'].mean())\
/waiting_calls_operators_B['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_B = waiting_calls_operators_B.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff B:', len(operators_max_waiting_B))
Number of operators with long waiting time for tariff B: 75
In [114]:
#C
waiting_calls_operators_C = df_C.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_C['waiting_time_z']=(waiting_calls_operators_C['avg_waiting_time']\
                                             -waiting_calls_operators_C['avg_waiting_time'].mean())\
/waiting_calls_operators_C['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_C = waiting_calls_operators_C.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff C:', len(operators_max_waiting_C))
Number of operators with long waiting time for tariff C: 88

3. outgoing calls by tariff

In [115]:
#A
outgoing_calls_A = df_A.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_A['calls_count_z']=(outgoing_calls_A['calls_count']-outgoing_calls_A['calls_count'].mean())\
/outgoing_calls_A['calls_count'].std()
operators_min_calls_A = outgoing_calls_A.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff A:', len(operators_min_calls_A))
Number of operators with small number of outgoing calls for tariff A: 210
In [116]:
#B
outgoing_calls_B = df_B.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_B['calls_count_z']=(outgoing_calls_B['calls_count']-outgoing_calls_B['calls_count'].mean())\
/outgoing_calls_B['calls_count'].std()
operators_min_calls_B = outgoing_calls_B.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff B:', len(operators_min_calls_B))
Number of operators with small number of outgoing calls for tariff B: 248
In [117]:
#C
outgoing_calls_C = df_C.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_C['calls_count_z']=(outgoing_calls_C['calls_count']-outgoing_calls_C['calls_count'].mean())\
/outgoing_calls_C['calls_count'].std()
operators_min_calls_C = outgoing_calls_C.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff C:', len(operators_min_calls_C))
Number of operators with small number of outgoing calls for tariff C: 211

4. internal calls by tariff

In [118]:
#A
internal_calls_A = df_A.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_A['internal_calls_z']=(internal_calls_A['calls_count']-internal_calls_A['calls_count'].mean())\
/internal_calls_A['calls_count'].std()
operators_max_internal_A = internal_calls_A.query('internal_calls_z>=0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff A:', len(operators_max_internal_A))
Number of operators with big number of outgoing internal calls for tariff A: 33
In [119]:
#B
internal_calls_B = df_B.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_B['internal_calls_z']=(internal_calls_B['calls_count']-internal_calls_B['calls_count'].mean())\
/internal_calls_B['calls_count'].std()
operators_max_internal_B = internal_calls_B.query('internal_calls_z>0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff B:', len(operators_max_internal_B))
Number of operators with big number of outgoing internal calls for tariff B: 30
In [120]:
#C
internal_calls_C = df_C.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_C['internal_calls_z']=(internal_calls_C['calls_count']-internal_calls_C['calls_count'].mean())\
/internal_calls_C['calls_count'].std()
operators_max_internal_C = internal_calls_C.query('internal_calls_z>0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff C:', len(operators_max_internal_C))
Number of operators with big number of outgoing internal calls for tariff C: 40

List of not effective operators by tariff plan

for tariff A

In [121]:
#Operators that are in all 4 lists:
worst_operators_A = set(operators_with_missed_A) & set(operators_max_waiting_A)\
& set(operators_min_calls_A) &set(operators_max_internal_A)
print('The number of least effective operators for tariff A:', len(worst_operators_A))
The number of least effective operators for tariff A: 10
In [122]:
#Operators that are in 2 lists:
bad_operators_A = set(operators_max_waiting_A) & set(operators_min_calls_A)
print('The number of operators of lower effectivness:', len(bad_operators_A))
The number of operators of lower effectivness: 37
In [123]:
#Operators that are at least in one of the lists:

avg_effective_operators_A = list(set(operators_with_missed_A) | set(operators_max_waiting_A) \
                                 | set(operators_min_calls_A) |set(operators_max_internal_A))

print('The number of not very effective operators for tariff A:', len(avg_effective_operators_A))
The number of not very effective operators for tariff A: 234

for tariff B

In [124]:
#Operators that are in all 4 lists:
worst_operators_B = set(operators_with_missed_B) & set(operators_max_waiting_B)\
& set(operators_min_calls_B) &set(operators_max_internal_B)
print('The number of least effective operators for tariff B:', len(worst_operators_B))
The number of least effective operators for tariff B: 6
In [125]:
#Operators that are in 2 lists:
bad_operators_B = set(operators_max_waiting_B) & set(operators_min_calls_B)
print('The number of operators of lower effectivness:', len(bad_operators_B))
The number of operators of lower effectivness: 39
In [126]:
#Operators that are at least in one of the lists:

avg_effective_operators_B = list(set(operators_with_missed_B) | set(operators_max_waiting_B) \
                                 | set(operators_min_calls_B) |set(operators_max_internal_B))

print('The number of not very effective operators for tariff B:', len(avg_effective_operators_B))
The number of not very effective operators for tariff B: 297

for tariff C

In [127]:
#Operators that are in all 4 lists:
worst_operators_C = set(operators_with_missed_C) & set(operators_max_waiting_C)\
& set(operators_min_calls_C) &set(operators_max_internal_C)
print('The number of least effective operators for tariff C:', len(worst_operators_C))
The number of least effective operators for tariff C: 2
In [128]:
#Operators that are in 2 lists:
bad_operators_C = set(operators_max_waiting_C) & set(operators_min_calls_C)
print('The number of operators of lower effectivness:', len(bad_operators_C))
The number of operators of lower effectivness: 29
In [129]:
#Operators that are at least in one of the lists:

avg_effective_operators_C = list(set(operators_with_missed_C) | set(operators_max_waiting_C) \
                                 | set(operators_min_calls_C) |set(operators_max_internal_C))

print('The number of not very effective operators for tariff C:', len(avg_effective_operators_C))
The number of not very effective operators for tariff C: 287
In [130]:
#Creating a db for visualisation of results:
#A
data_op_tariff_A=[{'avg_effective_operators': len(avg_effective_operators_A), 'bad_operators':len(bad_operators_A), \
          'worst_operators': len(worst_operators_A)
                }]
operators_funnel_tariff_A = pd.DataFrame(data_op_tariff_A).T.reset_index()
operators_funnel_tariff_A.columns=['operators_groups', 'number']
operators_funnel_tariff_A.insert(0, "tariff", "A")

#B
data_op_tariff_B=[{'avg_effective_operators': len(avg_effective_operators_B), 'bad_operators':len(bad_operators_B), \
          'worst_operators': len(worst_operators_B)
                }]
operators_funnel_tariff_B = pd.DataFrame(data_op_tariff_B).T.reset_index()
operators_funnel_tariff_B.columns=['operators_groups', 'number']

operators_funnel_tariff_B.insert(0, "tariff", "B")

#C
data_op_tariff_C=[{'avg_effective_operators': len(avg_effective_operators_C), 'bad_operators':len(bad_operators_C), \
          'worst_operators': len(worst_operators_C)
                }]
operators_funnel_tariff_C = pd.DataFrame(data_op_tariff_C).T.reset_index()
operators_funnel_tariff_C.columns=['operators_groups', 'number']

operators_funnel_tariff_C.insert(0, "tariff", "C")
In [131]:
frames = [operators_funnel_tariff_A, operators_funnel_tariff_B, operators_funnel_tariff_C]

operators_funnel_tariffs = pd.concat(frames)
operators_funnel_tariffs
Out[131]:
tariff operators_groups number
0 A avg_effective_operators 234
1 A bad_operators 37
2 A worst_operators 10
0 B avg_effective_operators 297
1 B bad_operators 39
2 B worst_operators 6
0 C avg_effective_operators 287
1 C bad_operators 29
2 C worst_operators 2
In [132]:
#chart with 4 groups :

fig = go.Figure(go.Funnel(y = operators_funnel_tariffs.operators_groups, x = operators_funnel_tariffs.number,
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#9467bd', '#3ac7c3' , '#72B7B2', '#c6f7e7', '#C9FBE5', 'teal', 'mediumturquoise', 'steelblue']}))
                      
fig.show()

Conclusion

In course of work we have defined 3 groups of operators with different degree of inefficiency in general and for every tariff plan:

  • the least effective operators - that performed ineffectively in every of 4 spheres;
  • operators of lower effectivness - performed ineffectively in 2 spheres (longer waiting time and fewer phone calls);
  • operators of average effectivness - performed ineffectively in one of 4 spheres.

The biggest share of ineffective operators (those who got to every of 4 lists) refer to tariff plan A, and the smallest - to tariff plan C.
On average operators working with tariff plan A demontarte less professional work than their colleagues. The data shows that there is a big number of new users of tariff plan A joining the company in the middle of October. Probably the company had to employ new operators for tariff A or have not arranged enough traning for the old ones.
We would recommend the following:

  • monitor operators performance that got to group 'operators of average effectivness';
  • arrange additional traning for those who got to group 'operators of lower effectivness';
  • additional traning and supervised work for those who got to group 'the least effective operators'.

    In the next part of the work we will test the hypothesis if the difference in opertators performance for different tariff plans has any statistical significance or this difference is duer to chance.

Part 3. Hypotheses testing

We are going to test the following hypothesis: number of calls (incoming and outgoung) depens on tariff plan of the call.

Stating hypothesis:
H0: average number of calls per operator is equal for all tariff plans
H1: average number of calls differ for all tariff plans

In [133]:
#Getting rid of outliers

#defining outliers with 3-sigma method for all tariff plans
#as we are calulating std for sample ddof is set to 1

std_score_A = np.std(df_A['calls_count'], ddof=1)
three_sigma_score_A_lower = round((df_A['calls_count'].mean() - std_score_A*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_A_upper = round((df_A['calls_count'].mean() + std_score_A*3),2)



std_score_B = np.std(df_B['calls_count'], ddof=1)
three_sigma_score_B_lower = round((df_B['calls_count'].mean() - std_score_B*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_B_upper = round((df_B['calls_count'].mean() + std_score_B*3),2)


std_score_C = np.std(df_C['calls_count'], ddof=1)
three_sigma_score_C_lower = round((df_C['calls_count'].mean() - std_score_C*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_C_upper = round((df_C['calls_count'].mean() + std_score_C*3),2)

print('99.7% of calls A are from 0 to ', three_sigma_score_A_upper,\
      '. \n99.7% of calls B are from 0 to ', three_sigma_score_B_upper,'.'\
     '\n99.7% of calls C are from 0 to ', three_sigma_score_C_upper,'.')
99.7% of calls A are from 0 to  289.84 . 
99.7% of calls B are from 0 to  59.98 .
99.7% of calls C are from 0 to  193.36 .
In [134]:
#setting df without outliers

df_A_no_outliers = df_A.query('calls_count<=@three_sigma_score_A_upper')

df_B_no_outliers = df_B.query('calls_count<=@three_sigma_score_B_upper')

df_C_no_outliers = df_C.query('calls_count<=@three_sigma_score_C_upper')
In [135]:
#defining Variance for the 3 atriffs to define whether or not we can consider them as equal for t-test.
variance_A = np.var(df_A_no_outliers['calls_count'])
variance_B = np.var(df_B_no_outliers['calls_count'])
variance_C = np.var(df_C_no_outliers['calls_count'])

print('Variance for A calls sample is ', variance_A,\
     '\nVariance for B calls sample is ', variance_B,\
     '\nVariance for C calls sample is ', variance_C,)
Variance for A calls sample is  1220.7079030525474 
Variance for B calls sample is  97.28893838599174 
Variance for C calls sample is  261.7183951236518

Variances vary greatly.

Testing if distribution is normal for 3 groups

In [136]:
sample_A = df_A_no_outliers['calls_count'].values.tolist()
sample_B = df_B_no_outliers['calls_count'].values.tolist()
sample_C = df_C_no_outliers['calls_count'].values.tolist()
In [137]:
samples_list = [sample_A, sample_B, sample_C]
In [138]:
def check_normality(sample, alpha):
#    alpha = .05 # significance level
    results = st.shapiro(sample)
    p_value = results[1] # the second value in the array of results (with index 1) - the p-value
    print('p-value: ', p_value)
    if (p_value < alpha):
        print("Null hypothesis rejected: the distribution is not normal")
    else:
        print("Failed to reject the null hypothesis: the distribution seems to be normal") 
In [139]:
for i in samples_list:
    check_normality(i, alpha=0.05)
p-value:  0.0
Null hypothesis rejected: the distribution is not normal
p-value:  0.0
Null hypothesis rejected: the distribution is not normal
p-value:  0.0
Null hypothesis rejected: the distribution is not normal
C:\Users\HP\anaconda3\lib\site-packages\scipy\stats\morestats.py:1681: UserWarning:

p-value may not be accurate for N > 5000.

The calculations proved the distribution is not normal for all samples.
Consequently we should choose a non parametric test fo testing hypothesis (it doesn’t assume the data fits a specific distribution type). Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test.
We are going to calculate statistical significance of the difference in conversion between the groups using the Mann-Whitney U test because we have independent data samples with quantative variables (nonparametric version of the Student t-test), also the data doesn't follow normal distribution. We can see that the data is not normally distributed.

As we are going to run 3 comparisons (A/B, B/C, A/C) we will alter the sinificance level respectively, using the Šidák method. Normally Alpha is set to 0.5 as a commonly accepted level in industry.
Calculating alpha with The Šidák method (guarantees family-wise error rate < ɑ).
The required significance level will be found as follows:

In [140]:
alpha_sidak = 1 - (1 - 0.05)**(1/3) #we have 3 groups which gives us 3 combinations for comparison (246/247, 246/248, 247/248)
round(alpha_sidak,4)
Out[140]:
0.017

Mann-Whitney U Test

In [141]:
from scipy.stats import mannwhitneyu
In [142]:
#forming 3 groups:
callsByOperators_A = df_A_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})

callsByOperators_B = df_B_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})

callsByOperators_C = df_C_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})
In [143]:
outliersA = int(np.percentile(callsByOperators_A['calls_count'], [99]))
outliersA
Out[143]:
4893
In [144]:
outliersB = int(np.percentile(callsByOperators_B['calls_count'], [99]))
outliersB
Out[144]:
2174
In [145]:
outliersC = int(np.percentile(callsByOperators_C['calls_count'], [99]))
outliersC
Out[145]:
3804
In [146]:
#arranging the data in the riught order for the Mann-Whitney U test
sample_op_A = callsByOperators_A.query('calls_count < @outliersA')['calls_count'].values.tolist() #values for A sample
sample_op_B = callsByOperators_B.query('calls_count < @outliersB')['calls_count'].values.tolist() #values for B sample
sample_op_C= callsByOperators_C.query('calls_count < @outliersC')['calls_count'].values.tolist() #values for C sample
In [147]:
#function for testing hypothesis
def check_hypothesis(group1,group2, alpha):
    p_value =  st.mannwhitneyu(group1, group2)[1]
    
    print('p-value: {0:.3f}'.format(p_value))

    if (p_value < alpha):
        print('We reject the null hypothesis for the two groups')
    else:
        print("We can't reject the null hypothesis for the two groups")  
In [148]:
#Comparison of number of calls for operators, tariff plans A and B 
check_hypothesis(sample_op_A,sample_op_B, alpha_sidak)
p-value: 0.000
We reject the null hypothesis for the two groups
In [149]:
#Comparison of number of calls for operators, tariff plans A and C 
check_hypothesis(sample_op_A,sample_op_C, alpha_sidak)
p-value: 0.000
We reject the null hypothesis for the two groups
In [150]:
#Comparison of number of calls for operators, tariff plans B and C 
check_hypothesis(sample_op_B,sample_op_C, alpha_sidak)
p-value: 0.310
We can't reject the null hypothesis for the two groups

The data provides sufficient evidence, given the significance level we selected (alpha_sidak = 1.7%), to reject the null hypothesis for the groups A and B, A and C. Therefore, we can conclude that average number of calls made and received by operators for tafiff A differ from those for tariff B(μA != μB); average number of calls made and received by operators for tafiff A differ from those for tariff C (μA != μC).
At the same time data does not provide sufficient evidence, given the significance level of 1.7%, to reject the null hypothesis for groups B and C. Therefore, we can conclude that average number of calls made and received by operators for tafiff B and the average number of calls made and received by operators for tafiff C are the same (μB = μC).

We will plot the charts to visualize the calculations.

In [151]:
fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)

ax.set_title("Box Plot of calls by tariff plans", fontsize= 20)
ax.set

data = [callsByOperators_A['calls_count'],
        callsByOperators_B['calls_count'],
        callsByOperators_C['calls_count']]

ax.boxplot(data,
           labels= ['A', 'B', 'C'],
           showmeans= True)

plt.xlabel("Tariff plan")
plt.ylabel("Number of calls by operators")

plt.show()

The graphical testing of homogeneity of variances supports the statistical testing findings.
There is much more difference between number of calls for tariffs A/B and A/C;
there is no statistically significant difference between groups B and C.

T-test

The t-test is invalid for small samples from non-normal distributions, but it is valid for large samples from non-normal distributions. Thus we are going to run the t-test for ous samples as they are are big enough (>50).
We will test the difference for every group in general to prove the previous calculations with t-test.

In [152]:
#forming a table
samplesA_B_C = df_tariffs.pivot_table(index='date', values='calls_count', columns='tariff_plan', aggfunc='sum').reset_index().fillna(0)
samplesA_B_C.columns = ['date', 'A', 'B', 'C']

samplesA_B_C.head()
Out[152]:
date A B C
0 2019-08-02 0.0 19.0 0.0
1 2019-08-03 0.0 27.0 0.0
2 2019-08-04 0.0 4.0 0.0
3 2019-08-05 0.0 175.0 0.0
4 2019-08-06 0.0 148.0 1.0

Setting out outliers (threshold of number of calls for 99% of the sample)

In [153]:
outliersA_t = int(np.percentile(samplesA_B_C['A'], [99]))
outliersA_t
Out[153]:
7301
In [154]:
outliersB_t = int(np.percentile(samplesA_B_C['B'], [99]))
outliersB_t
Out[154]:
2890
In [155]:
outliersC_t = int(np.percentile(samplesA_B_C['C'], [99]))
outliersC_t
Out[155]:
4486

Forming samples

In [156]:
sample_A_t = samplesA_B_C.query('A < @outliersA_t')['A'].values.tolist()
In [157]:
sample_B_t = samplesA_B_C.query('B < @outliersB_t')['B'].values.tolist()
In [158]:
sample_C_t = samplesA_B_C.query('C < @outliersC_t')['C'].values.tolist()

Checking variance

In [159]:
#defining Variance for the 3 atriffs to define whether or not we can consider them as equal for t-test.
variance_A_t = np.var(sample_A_t)
variance_B_t = np.var(sample_B_t)
variance_C_t = np.var(sample_C_t)

print('Variance for A sample is ', variance_A_t.round(1),\
     '\nVariance for B sample is ', variance_B_t.round(1),\
     '\nVariance for C sample is ', variance_C_t.round(1))
Variance for A sample is  4003966.7 
Variance for B sample is  825107.2 
Variance for C sample is  1507358.7

Variances vary greatly.

Testing hipotheses

In [160]:
results_A_B = st.ttest_ind(
    sample_A_t, 
    sample_B_t, equal_var = False)

results_A_C = st.ttest_ind(
    sample_A_t, 
    sample_C_t, equal_var = False)

results_C_B = st.ttest_ind(
    sample_C_t, 
    sample_B_t, equal_var = False)


#we are going to use alpha_sidak here as there were 3 comparisons made

print('p-value to compare the groups A and B: ', results_A_B.pvalue)
print('p-value to compare the groups A and C: ', results_A_C.pvalue)
print('p-value to compare the groups C and B: ', results_C_B.pvalue)



if (results_A_B.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups A and B")
else:
    print("Null hypothesis not rejected for groups A and B")
    
if (results_A_C.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups A and C")
else:
    print("Null hypothesis not rejected for groups A and C")
    
if (results_C_B.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups C and B")
else:
    print("Null hypothesis not rejected for groups C and B")
p-value to compare the groups A and B:  4.440736641361894e-15
p-value to compare the groups A and C:  1.6935552051637212e-10
p-value to compare the groups C and B:  0.03714718957914091
Null hypothesis rejected for groups A and B
Null hypothesis rejected for groups A and C
Null hypothesis not rejected for groups C and B
In [161]:
fig = plt.figure(figsize= (8, 8))
ax = fig.add_subplot(111)

ax.set_title("Box Plot of calls by tariff plans", fontsize= 20)
ax.set

data = [samplesA_B_C['A'],
        samplesA_B_C['B'],
        samplesA_B_C['C']]

ax.boxplot(data,
           labels= ['A', 'B', 'C'],
           showmeans= True)

plt.xlabel("Tariff plan")
plt.ylabel("Number of calls")

plt.show()

Both tests proved that we cannot reject the null hypothesis for groups B and C. Therefore, we can conclude that average number of calls for tariff plan B and C do not differ: (μB = μC ).
But we can reject it for comparisons A/B and A/C: the data provides sufficient evidence, given the significance level (1.7%), to reject the null hypothesis).
Therefore, we can conclude that average number of calls for tariff plan A differs from those for tariff plan B and C: (μA != μB and μA != μC ).
Consequently the difference in operators performance is not due to chance and has is statistically significant. Work of operators fo tariff A differs from work of operators for other tariffs.

Part 4. Overall conclusion

In course of work we have studied data on call company clients and operators:

  • data on calls from 2019-08-02 till 2019-11-28;
  • unique users - 307;
  • unique operators - 1092;
  • and tariff_plans A, B and C.
    We have preprocessed the data:
  • dropped missing values,
  • dropped duplicated data,
  • changed the data type and improved the memory usage.

Having studied the given features (number of missed incoming calls, waiting time for incoming calls (average speed of answer), number of outgoing calls, number of internal calls) we have generated 4 lists of operators that performed worse than their colleagues on average. We used z-score method to define the threshold. Next we created 3 groups of operators combining the 4 lists:

  • the least effective operators - that performed ineffectively in every of 4 spheres;
  • operators of lower effectivness - performed ineffectively in 2 spheres (longer speed of answer and fewer phone calls were taken as the main criteria);
  • operators of average effectivness - performed ineffectively in one of 4 spheres.

Likewize we created such groups for every tariff plan. Analyzing the operators performance for different tariff plans we have defined that that share of operators that performed badly is the highest for tariff A and the lowest for tariff C. Tariff A has the smallest number of outgoing calls, the smallest number of answered calls and the biggest share of internal calls. Generally operators working on plan A demonstrate lower professional level than their colleagues working on other tariff plans.

We have tested hypothesis that number of calls depend od tariff plan of the call. The calculations show that with given significance level (1.7%) there is a statistically signifficant difference in number of calls for tariff A, and there is no statistical difference between number of calls for tariff B and C. Supposedly due to a big number of new users of tariff A that joined the company in the middle of October the company have not allocated enough operators and not provided enough trainings.

The results of the study can be used further for labeling , supervised learning and developing a model for ineffective operators . This will automate the process of defining ineffective operators in future.

General recommendations:

- additional trainings for operators  working on tariff A (more insights into tariff A aspects); 
- additional motivation for operators  working on tariff A (bonus system, incentives, etc.)
- monitoring operators' performance that got to group 'operators of average effectiveness’;
- arrange additional training for those who got to group ' operators of lower effectiveness’;
- additional training and supervised work for those who got to group 'the least effective operators’
- rewards and incentives for those who got to group ‘good operators’. 

To concude we would recommend to make a shift from defining operators' effectiveness by setting such goals as handling a certain number of calls within a designated period of time. This puts a lot of pressure on operators and does not lead to clients' sutisfaction in the long run. Today's effectiveness might rather be measured in resolved issues rate and retantion rate. Resolving issues and customers satisfaction can be better motivation for operators than reaching big number of calls.

Tableau dashboard

Presentation

In [ ]: