The aim of the project is to study effectivness of phone operators for virtual telephony service CallMeMaybe.
To define ineffective operators we will study the following features:

number of missed incoming calls (internal and external),
waiting time for incoming calls,
number of outgoing calls,
number of internal calls between operators.

Also we will study these features for every tariff plan the company has (A, B, C) and define ineffective operators for every of this plan.

We will test the hypothesis if the difference in operators performance for different tariff plans is statistically significant.
Finally we will draw basic conclusions and develop some recommendations on how to improve operators performance.

Table of contents:¶

Part 1. Data preprocessing
Reading the data base
Checking and changing the data types
Checking for missing values
Checking for duplicated data
Conclusion
Part 2. Exploratory data analysis (EDA)
Correlation matrix
- missed incoming calls
- waiting time
- outgoing calls
- internal calls
Ineffective Operators
Studying the features for different tarrif plans
Parameters of efficiency by tariff plans
List of not effective operators by tariff plan
Conclusion
Part 3. Hypotheses testing
Mann-Whitney U Test
T-test
Part 4. Overall conclusion
Tableau dashboard
Presentation
Sources

import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from numpy import median
from scipy import stats as st
import seaborn as sns
import math 
from plotly.subplots import make_subplots

Part 1. Loading the data and preprocessing ¶

next part

Reading the data base ¶

#adjustments of the path for the Practicum platform:
#df=pd.read_csv('/telecom_clients_us.csv')
df_cl = pd.read_csv('./telecom_clients_us.csv')
display(df_cl.head())

#adjustments of the path for the Practicum platform:
#df=pd.read_csv('/telecom_dataset_us.csv')
df_dataset = pd.read_csv('./telecom_dataset_us.csv')
display(df_dataset.head())

Checking and changing the data types ¶

df_cl.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 58.0 KB

#changing data type
df_cl['date_start'] = pd.to_datetime(df_cl['date_start'])

df_cl.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      732 non-null    int64         
 1   tariff_plan  732 non-null    object        
 2   date_start   732 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 35.8 KB

df_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 7.3 MB

#'date' column
#changing data type
df_dataset['date'] = pd.to_datetime(df_dataset['date'], format=('%Y-%m-%d %H:%M:%S')) 
#getting rid of time zones
df_dataset['date']= df_dataset['date'].dt.tz_localize(None)

df_dataset['internal'] = df_dataset['internal'].astype(bool)

df_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              53902 non-null  int64         
 1   date                 53902 non-null  datetime64[ns]
 2   direction            53902 non-null  object        
 3   internal             53902 non-null  bool          
 4   operator_id          45730 non-null  float64       
 5   is_missed_call       53902 non-null  bool          
 6   calls_count          53902 non-null  int64         
 7   call_duration        53902 non-null  int64         
 8   total_call_duration  53902 non-null  int64         
dtypes: bool(2), datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 4.2 MB

Changing the data type for columns (object to datetime and object to boolean) has significantly improved the memory usage: from 58.0 KB to 35.8 KB and from 7.3 MB to 4.2 MB for the two data sets. After dealing with the missing values we will also change the type of 'operator_id' column.

Checking for missing values ¶

# checking for missing values in the data set
df_dataset.isnull().sum()

user_id                   0
date                      0
direction                 0
internal                  0
operator_id            8172
is_missed_call            0
calls_count               0
call_duration             0
total_call_duration       0
dtype: int64

df_dataset['operator_id'].value_counts()

901884.0    348
901880.0    337
893804.0    325
885876.0    319
891410.0    290
           ... 
926214.0      1
946468.0      1
960674.0      1
928284.0      1
891948.0      1
Name: operator_id, Length: 1092, dtype: int64

df_dataset['user_id'].nunique()

307

df_cl['tariff_plan'].value_counts()

C    395
B    261
A     76
Name: tariff_plan, dtype: int64

df_cl.isnull().sum()

user_id        0
tariff_plan    0
date_start     0
dtype: int64

df_dataset.shape

(53902, 9)

df_dataset.groupby(['date']).size().plot(
    kind='bar',
    grid=True,
    title='Number of data without NaNs',
    figsize=(25,5)
)
plt.show()

We can see that there is clear repeatability in shares of collected data: amount of data collected on Saturdays and Sundays is significantly smaller than on other days of the week.

Studying missing values in 'operator_id".

df_dataset.groupby(['user_id'])['operator_id'].nunique().reset_index().sort_values(by='operator_id', ascending=False)

According to the table users can have from 0 to 50 different opperators.

users_operators = df_dataset.groupby(['user_id'])['operator_id'].nunique().reset_index()\
.groupby(['operator_id'])['user_id'].nunique().reset_index()
users_operators.columns = ['operators_cnt', 'users_cnt']
users_operators.head(3)

users_operators.plot(x='operators_cnt', y='users_cnt', kind='bar', grid=True, figsize=(20,5), cmap='Set2', \
                     title='Number of unique users per operators')

plt.show()

17 unique users are asigned to non operator;
1 operator has 107 unique users;
2 operatos have 63 unique users;
50 operatos have 1 unique user.

#Calculatinug the 95th and 99th percentiles for the number of users. 
np.percentile(users_operators['users_cnt'], [95, 99])

array([58.65, 96.88])

Only 5% of operators have mothe than 60 unique users; only 1% of operators have mothe than 97 unique users.

#calculating percentage of missing values:
df_dataset.isnull().sum()/len(df_dataset)*100

user_id                 0.000000
date                    0.000000
direction               0.000000
internal                0.000000
operator_id            15.160847
is_missed_call          0.000000
calls_count             0.000000
call_duration           0.000000
total_call_duration     0.000000
dtype: float64

Percentage of missing values in 'operator_id' is 15%. As one user can have more than 1 operator ( the number of operators per user varies from 1 to 50), we cannot use transform() method grouping by users to fill the missing valuses. We will have to drop these part of the data as we cannot restore the missing values here.

df_dataset = df_dataset.dropna(axis=0, subset=['operator_id'])

df_dataset.isnull().sum()/len(df_dataset)*100

user_id                0.0
date                   0.0
direction              0.0
internal               0.0
operator_id            0.0
is_missed_call         0.0
calls_count            0.0
call_duration          0.0
total_call_duration    0.0
dtype: float64

df_dataset['internal'].value_counts()

False    39861
True      5869
Name: internal, dtype: int64

As now we have only minor number of missing values in column 'internal' (<2% of the data set) we can drop them to proceed with the data further.

df_dataset = df_dataset.dropna(axis=0, subset=['internal'])

Now we can change the data type

# changing type of the 'operator_id' column:
df_dataset['operator_id']= df_dataset['operator_id'].astype(int)

df_dataset.sample()

print('There are', df_dataset['operator_id'].nunique(), 'unique operators in the amended data set.')

There are 1092 unique operators in the amended data set.

print('There are', df_dataset['user_id'].nunique(), 'unique users in the amended data set.')

There are 290 unique users in the amended data set.

#Checking data for zeros:
for i in df_dataset.columns:
    print(i, len(df_dataset[df_dataset[i]==0]))

user_id 0
date 0
direction 0
internal 39861
operator_id 0
is_missed_call 30212
calls_count 0
call_duration 15213
total_call_duration 955

#Checking data for zeros:
for i in df_cl.columns:
    print(i, len(df_cl[df_cl[i]==0]))

user_id 0
tariff_plan 0
date_start 0

We have a lot of zeros for internal calls and missed calls, as well as call duration. In the second data bse we do not have zeros at all.

Checking for duplicated data ¶

#Checking for duplicated data
duplicated1=df_dataset.duplicated().sum()
print('There are', duplicated1, 'duplicates in the general data set.')

There are 4184 duplicates in the general data set.

#Checking for duplicated data
duplicated2=df_cl.duplicated().sum()
print('There are', duplicated2, 'duplicates in the additional data set (telecom_clients).')

There are 0 duplicates in the additional data set (telecom_clients).

#checking what kind of duplicates we have
duplicated_rows = df_dataset[df_dataset.duplicated()]
duplicated_rows.sample(15)

#checking the dates when the data was duplicated:
duplicated_rows['date'].dt.date.unique()

array([datetime.date(2019, 8, 5), datetime.date(2019, 8, 14),
       datetime.date(2019, 8, 15), datetime.date(2019, 8, 19),
       datetime.date(2019, 8, 22), datetime.date(2019, 8, 23),
       datetime.date(2019, 8, 27), datetime.date(2019, 8, 29),
       datetime.date(2019, 9, 3), datetime.date(2019, 9, 5),
       datetime.date(2019, 9, 10), datetime.date(2019, 9, 11),
       datetime.date(2019, 9, 12), datetime.date(2019, 9, 16),
       datetime.date(2019, 9, 23), datetime.date(2019, 9, 25),
       datetime.date(2019, 9, 26), datetime.date(2019, 9, 27),
       datetime.date(2019, 10, 3), datetime.date(2019, 10, 4),
       datetime.date(2019, 10, 8), datetime.date(2019, 10, 11),
       datetime.date(2019, 10, 14), datetime.date(2019, 10, 18),
       datetime.date(2019, 10, 22), datetime.date(2019, 10, 23),
       datetime.date(2019, 10, 28), datetime.date(2019, 10, 29),
       datetime.date(2019, 10, 31), datetime.date(2019, 11, 1),
       datetime.date(2019, 11, 5), datetime.date(2019, 11, 12),
       datetime.date(2019, 11, 13), datetime.date(2019, 11, 20),
       datetime.date(2019, 11, 21), datetime.date(2019, 11, 22),
       datetime.date(2019, 11, 25), datetime.date(2019, 10, 24),
       datetime.date(2019, 11, 11), datetime.date(2019, 11, 18),
       datetime.date(2019, 11, 28), datetime.date(2019, 10, 2),
       datetime.date(2019, 8, 18), datetime.date(2019, 8, 21),
       datetime.date(2019, 8, 30), datetime.date(2019, 8, 31),
       datetime.date(2019, 9, 6), datetime.date(2019, 9, 7),
       datetime.date(2019, 9, 17), datetime.date(2019, 9, 18),
       datetime.date(2019, 9, 24), datetime.date(2019, 10, 7),
       datetime.date(2019, 10, 9), datetime.date(2019, 10, 10),
       datetime.date(2019, 10, 15), datetime.date(2019, 10, 16),
       datetime.date(2019, 10, 17), datetime.date(2019, 10, 21),
       datetime.date(2019, 10, 30), datetime.date(2019, 11, 4),
       datetime.date(2019, 11, 14), datetime.date(2019, 11, 15),
       datetime.date(2019, 11, 16), datetime.date(2019, 11, 23),
       datetime.date(2019, 11, 24), datetime.date(2019, 11, 27),
       datetime.date(2019, 8, 6), datetime.date(2019, 8, 7),
       datetime.date(2019, 8, 8), datetime.date(2019, 8, 9),
       datetime.date(2019, 8, 12), datetime.date(2019, 8, 13),
       datetime.date(2019, 9, 1), datetime.date(2019, 9, 2),
       datetime.date(2019, 9, 8), datetime.date(2019, 9, 19),
       datetime.date(2019, 9, 30), datetime.date(2019, 11, 8),
       datetime.date(2019, 11, 7), datetime.date(2019, 11, 19),
       datetime.date(2019, 10, 25), datetime.date(2019, 11, 26),
       datetime.date(2019, 8, 16), datetime.date(2019, 8, 20),
       datetime.date(2019, 8, 26), datetime.date(2019, 8, 28),
       datetime.date(2019, 9, 9), datetime.date(2019, 9, 20),
       datetime.date(2019, 10, 1), datetime.date(2019, 9, 13),
       datetime.date(2019, 11, 6), datetime.date(2019, 8, 24),
       datetime.date(2019, 8, 25), datetime.date(2019, 9, 29),
       datetime.date(2019, 10, 19), datetime.date(2019, 10, 26),
       datetime.date(2019, 11, 2), datetime.date(2019, 10, 27),
       datetime.date(2019, 9, 15), datetime.date(2019, 10, 6),
       datetime.date(2019, 8, 17), datetime.date(2019, 9, 4),
       datetime.date(2019, 9, 14), datetime.date(2019, 9, 22),
       datetime.date(2019, 10, 12), datetime.date(2019, 10, 13),
       datetime.date(2019, 11, 3), datetime.date(2019, 10, 20),
       datetime.date(2019, 9, 21), datetime.date(2019, 9, 28),
       datetime.date(2019, 10, 5), datetime.date(2019, 11, 10),
       datetime.date(2019, 11, 17), datetime.date(2019, 11, 9)],
      dtype=object)

print('General data represents data for the period from', df_dataset['date'].min(), 'till', df_dataset['date'].max(), \
      '.\nThe duplicated rows came across the data set from the period', duplicated_rows['date'].min(), 'till', \
      duplicated_rows['date'].max(), '.' )

General data represents data for the period from 2019-08-02 00:00:00 till 2019-11-28 00:00:00 .
The duplicated rows came across the data set from the period 2019-08-05 00:00:00 till 2019-11-28 00:00:00 .

It means that there are duplicates throughout the whole period. Presumably that might have happened due to some server problems that continuously produced duplicated data.

#checking the operators whose data was duplicated:
duplicated_rows['operator_id'].value_counts()

885876    36
893804    36
901880    35
891410    29
891414    28
          ..
919956     1
899268     1
891254     1
952656     1
895266     1
Name: operator_id, Length: 776, dtype: int64

#checking area of the problem caused duplicated data:
for i in duplicated_rows:
    print(i, ":", duplicated_rows[i].nunique())

user_id : 240
date : 114
direction : 2
internal : 2
operator_id : 776
is_missed_call : 2
calls_count : 170
call_duration : 1543
total_call_duration : 1787

df_dataset.query('operator_id==891410')

So far we cannot see any pattern in duplicates. There are data on different events, dates (114 days from 05.08.2019 till 21.09.2019), different operators (775 unique operators have duplicates in their data).

# length before dropping duplicates 
length1=  len(df_dataset) 
  
# dropping duplicate values 
df_dataset.drop_duplicates(keep=False,inplace=True) 
  
# length after removing duplicates 
length2=len(df_dataset) 
  
# printing all data frame lengths  
print('Length before dropping duplicates:', length1, '\nLength before dropping duplicates:', length2)

Length before dropping duplicates: 45730 
Length before dropping duplicates: 37362

print('Final data includes: \nunique users:', df_dataset['user_id'].nunique(),\
      '\nunique operators:', df_dataset['operator_id'].nunique())

Final data includes: 
unique users: 288 
unique operators: 1082

Conclusion ¶

The data for analysis is splited between the 2 tables: telecom_dataset_us and telecom_clients. The first table is the main one. It contains:

9 columns;
53902 entries;
data on calls from 2019-08-02 till 2019-11-28 ;
unique users - 307;
unique operators - 1092;
and tariff_plans A, B and C.

The data contained missing values in the following columns:
-internal 0.21%
-operator_id 15.16%

As one user can have more than 1 operator ( the number of operators per user varies from 1 to 50), we cannot use transform() method grouping by users to fill the missing valuses. We had no other option but to drop these part of the data as we cannot restore the missing values here.
As now we have only minor number of missing values in column 'internal' (<2% of the data set) we can drop them without scewing the data.
We have determined that there are 4179 duplicates in the general data set. Studying the duplicated rows we have not figured any pattern. there are duplicates throughout the whole period. Presumably that might have happened due to some server problems that continuously produced duplicated data.

All in all during preprocessing we have:

changed the data type for dates which improved the memory usage: from 7.3 MB to 5 MB for the main data set
dropped the missing values in two columns;
Changed the data type in the operator_id column for 'int'.

Final dataset contains:
- 37312 entries;
- data on calls from 2019-08-02 till 2019-11-28;
- unique users - 288;
- unique operators - 1082.

Part 2. Exploratory data analysis (EDA) ¶

back to top

previous part

Statistical summary of the data ¶

#statistical summary for numerical variables
df_dataset.drop(['user_id', 'operator_id'], axis = 1).describe().round(2)

df_dataset['direction'].value_counts()

out    25909
in     11453
Name: direction, dtype: int64

Correlation matrix ¶

#defining correlation between the features
corr = df_dataset.drop(['user_id', 'operator_id'] , axis = 1).corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
            cmap=sns.diverging_palette(220, 10, as_cmap=True))
plt.show()

The matrix shows that we have 2 pairs of itercorrelated features: call_duration with total_call_duration and calls_count with total_call_duration. We are going to substitude call_duration with total_call_duration features with average waiting time. This way we will get a set of 4 not correlated features and will set them as a criteria of effectiveness.

missed incoming calls (internal and external) ¶

We will define ineffective operatos according to the 1st parameter - naumber of missed calls.

missed_calls = df_dataset.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls['is_missed_call'] = missed_calls['is_missed_call'].astype(int)
missed_calls

plt.figure(figsize=(15,3))
missed_calls['is_missed_call'].plot.hist(density=True, bins= 27, color='#cce1e1', log=True)

plt.title('Number of missed calls')
plt.show()

# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="is_missed_call", data=missed_calls, color='#cce1e1') 
ax.set(xlabel='number of missed calls')
plt.show()

#Calculatinug the 95th and 99th percentiles for the number of missed calls (setting out outliers). 
np.percentile(missed_calls['is_missed_call'], [95, 99])

array([ 4., 11.])

missed_calls['is_missed_call'].describe()

count    739.000000
mean       0.847091
std        2.220172
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max       27.000000
Name: is_missed_call, dtype: float64

#operators with missed incoming calls:
operators_with_missed = missed_calls.query('is_missed_call>0')['operator_id']
operators_with_missed.values.tolist()

[913942,
 885890,
 940588,
 944226,
 885876,
 913938,
 919554,
 940630,
 944222,
 944216,
 944220,
 901880,
 944218,
 931458,
 951508,
 940622,
 921818,
 908960,
 937956,
 940652,
 893804,
 952948,
 925922,
 905104,
 921318,
 940634,
 899250,
 940616,
 940658,
 882686,
 930020,
 938896,
 906404,
 919204,
 919206,
 918978,
 939706,
 921306,
 906408,
 948286,
 906410,
 882684,
 939370,
 937958,
 893402,
 905862,
 906396,
 919166,
 920706,
 958460,
 906406,
 919164,
 905542,
 900354,
 882680,
 915556,
 884408,
 903318,
 912296,
 916424,
 900892,
 887276,
 890406,
 920728,
 940614,
 929424,
 911102,
 896536,
 937368,
 903312,
 944764,
 899268,
 923526,
 905538,
 947304,
 923528,
 937732,
 906392,
 937860,
 937960,
 937736,
 904016,
 904038,
 900746,
 937962,
 906070,
 945046,
 905300,
 954650,
 937780,
 906412,
 964666,
 940624,
 934534,
 890404,
 889754,
 910226,
 888534,
 888532,
 911138,
 908958,
 911140,
 934098,
 911310,
 912010,
 940442,
 913788,
 882690,
 891410,
 902510,
 907952,
 907502,
 938414,
 906680,
 904044,
 911142,
 939376,
 937864,
 919218,
 944766,
 912684,
 906400,
 917850,
 944646,
 939476,
 948288,
 902910,
 939478,
 948756,
 940430,
 948758,
 902532,
 939708,
 913886,
 920902,
 909134,
 937868,
 937902,
 920666,
 917852,
 919794,
 905564,
 938614,
 918436,
 918888,
 941844,
 937898,
 910540,
 919552,
 911136,
 905430,
 945052,
 937966,
 907982,
 937874,
 920852,
 939224,
 914440,
 921320,
 894232,
 958452,
 958434,
 927912,
 927910,
 958416,
 893420,
 894120,
 894224,
 894230,
 921814,
 954750,
 957922,
 935958,
 926872,
 926490,
 895574,
 895576,
 895776,
 936514,
 955032,
 935348,
 928382,
 958640,
 959312,
 932460,
 932798,
 933070,
 933986,
 969272,
 934074,
 934076,
 930242,
 888406,
 964510,
 888868,
 929876,
 890228,
 890582,
 890618,
 891414,
 928888,
 960620,
 928886,
 896012,
 894614,
 925826,
 937432,
 899082,
 924936,
 924930,
 900194,
 900352,
 924928,
 900492,
 924546,
 900788,
 937760,
 937762,
 900894,
 951490,
 937808,
 901884,
 924948,
 937750,
 897872,
 924960,
 924954,
 898414,
 896538,
 898422]

operators_with_missed.shape

(230,)

In general 739 operators have from 0 to 27 missed calls during the period. Middle 50% of them have from 0 to 1 missed call.
Only 5% of operators have over 4 missed calls, and only 1% of operators have over 11 missed calls. 509 out of the 739 operators do not have missed incoming calls.

We have received a list of 230 operators IDs that had missed calls during this period.

waiting time ¶

waiting_calls = df_dataset.query('direction=="in"') 
                                 #and is_missed_call==True')
waiting_calls.head()

waiting_calls_operators = df_dataset.query('direction=="in"').groupby('operator_id')\
.agg({'calls_count': 'sum', 'call_duration': 'sum', 'total_call_duration': 'sum'})\
.reset_index().sort_values(by='calls_count', ascending=False)

waiting_calls_operators.sample()

waiting_calls_operators['waiting_time'] = waiting_calls_operators['total_call_duration'] - waiting_calls_operators['call_duration']
waiting_calls_operators['avg_waiting_time'] = waiting_calls_operators['waiting_time']/waiting_calls_operators['calls_count']
waiting_calls_operators.sample()

waiting_calls_operators['avg_waiting_time'].describe()

count    739.000000
mean      17.571811
std       12.173860
min        0.686391
25%        9.387500
50%       14.428571
75%       21.937220
max      115.500000
Name: avg_waiting_time, dtype: float64

plt.figure(figsize=(15,3))
waiting_calls_operators['avg_waiting_time'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Distribution of average wating time among operators')
plt.xlabel('Waiting time, sec.')
plt.show()

Destribution of average waiting_time is scewed with long tail to the right. There are lots of outliers but most of the distribution lies within 1 minute. Average waiting_time varies greatly from 1 sec to 115 sec. Median value is 14 sec. with standard deviation of 12. Middle 50% of operators do not pick up on average from 9 to 22 seconds.

Z-score¶

We will define how successful is every operator in relation to the mean value of his/her average waiting time. To do that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of avg wating time of an operator is identical to the mean score.

#counting z-score for sales to define distribution 
waiting_calls_operators['waiting_time_z']=(waiting_calls_operators['avg_waiting_time']-waiting_calls_operators['avg_waiting_time'].mean())/waiting_calls_operators['avg_waiting_time'].std()
waiting_calls_operators['operator_id'] = waiting_calls_operators['operator_id'].astype(str)
waiting_calls_operators.head(3)

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

#distinguishing z-score by colors: green for negative values (shorter waiting time), red for positive ones (longer waiting time)
waiting_calls_operators_z = waiting_calls_operators.query('-0.05 <=waiting_time_z<=0.08')\
.sort_values(by='avg_waiting_time', ascending=False).reset_index(drop=True)
waiting_calls_operators_z['colors'] = ['teal' if x<0 else 'red' for x in waiting_calls_operators_z['waiting_time_z']]
waiting_calls_operators_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=waiting_calls_operators_z.operator_id, xmin=0, xmax=waiting_calls_operators_z.waiting_time_z, colors=waiting_calls_operators_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators average waiting time relation to mean value of the waiting time \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least successful operators IDs that have the longest average waiting time (those with positive z-score):

operators_max_waiting = waiting_calls_operators.query('waiting_time_z>=0')['operator_id']
operators_max_waiting = operators_max_waiting.astype(int).values.tolist()
operators_max_waiting

[919790,
 919554,
 919794,
 919552,
 908958,
 919792,
 908960,
 899082,
 906406,
 939370,
 887276,
 906404,
 951508,
 917852,
 921306,
 906070,
 911138,
 954750,
 907502,
 906410,
 944218,
 906408,
 893420,
 902532,
 944220,
 944216,
 911136,
 909134,
 944222,
 906680,
 882690,
 944226,
 896536,
 912010,
 910540,
 901886,
 934534,
 906396,
 906400,
 910018,
 923528,
 902510,
 905542,
 937732,
 904044,
 937962,
 944210,
 958460,
 938414,
 969272,
 936514,
 921320,
 971102,
 901350,
 911140,
 906398,
 888534,
 891906,
 907982,
 892532,
 905564,
 920706,
 924960,
 907952,
 900194,
 916618,
 907224,
 904016,
 920982,
 951506,
 920902,
 907994,
 935958,
 920874,
 893172,
 934428,
 962268,
 914440,
 899268,
 901894,
 891900,
 884408,
 910516,
 903512,
 918436,
 924928,
 944228,
 887282,
 924956,
 951492,
 895786,
 912684,
 900746,
 902910,
 934430,
 906076,
 908834,
 927070,
 958434,
 944644,
 958430,
 892534,
 898558,
 901492,
 905300,
 937368,
 924930,
 937374,
 894656,
 940458,
 925106,
 906394,
 919994,
 924950,
 924952,
 924954,
 911102,
 933832,
 948758,
 888868,
 898560,
 939478,
 884412,
 919214,
 892536,
 891946,
 922356,
 896538,
 909768,
 907970,
 911142,
 910958,
 958432,
 924948,
 921102,
 888532,
 894120,
 895782,
 934076,
 956484,
 908152,
 915958,
 925808,
 926490,
 896010,
 937742,
 958456,
 882688,
 944246,
 888406,
 924940,
 891646,
 892538,
 954290,
 924936,
 929340,
 947816,
 886674,
 938432,
 907174,
 951332,
 906402,
 939212,
 958478,
 896018,
 892530,
 934432,
 932500,
 921104,
 907440,
 924946,
 934098,
 952462,
 914842,
 904200,
 944474,
 956080,
 958468,
 944560,
 944648,
 904204,
 909392,
 928092,
 907504,
 917252,
 924934,
 913780,
 891744,
 901498,
 901038,
 937362,
 895140,
 951118,
 948284,
 934424,
 933996,
 950972,
 945058,
 914426,
 909624,
 908098,
 908130,
 924932,
 888540,
 964712,
 891918,
 969288,
 970254,
 882478,
 937366,
 882476,
 961552,
 899892,
 908162,
 897872,
 955164,
 904372,
 970244,
 899898,
 968368,
 905842,
 960296,
 964518,
 886892,
 888536,
 958470,
 952292,
 890226,
 973286,
 917680,
 932304,
 911152,
 911214,
 913474,
 918956,
 930510,
 914848,
 915614,
 940508,
 935244,
 935956,
 960294,
 940512,
 892798,
 920930,
 922114,
 923254,
 924662,
 972412,
 936382,
 970252,
 970240,
 969380,
 888538,
 929228,
 899912,
 907178,
 907618,
 907986,
 946468,
 947612,
 952678,
 899900,
 899906,
 908300,
 955436,
 900792,
 941628,
 905478,
 952916]

len(operators_max_waiting)

270

This way we have received 270 operators IDs that have maximum average waiting time (longer or equal to the average one).

outgoing calls ¶

Small number of outgoing calls is a sign of a bad operator.

outgoing_calls = df_dataset.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index().sort_values(by='calls_count', ascending=False)
outgoing_calls.sample()

plt.figure(figsize=(15,3))
outgoing_calls['calls_count'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Number of outgoing calls')
plt.xlabel('number of calls')
plt.show()

# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="calls_count", data=outgoing_calls, color='#cce1e1') 
ax.set(xlabel='number of outgoing calls')
plt.show()

outgoing_calls['calls_count'].describe()

count      873.000000
mean       626.967927
std       2824.727767
min          1.000000
25%         10.000000
50%         84.000000
75%        533.000000
max      53301.000000
Name: calls_count, dtype: float64

#Calculatinug the 95th and 99th percentiles for the number of outgoing calls (setting out outliers). 
np.percentile(outgoing_calls['calls_count'], [95, 99])

array([2153.  , 4732.44])

In general 873 operators have from 1 to 53301 outgoing calls during the period. Middle 50% of them have from 10 to 533 outgoing calls with median value of 84 calls.
Only 5% of operators have over 2153 outgoing calls, and only 1% of operators have over 4732 outgoing calls.

Z-score¶

We will define how successful is every operator in relation to the mean value of number of calls he/she performed. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of calls of an operator is identical to the mean score.

#counting z-score for number of outgoing calls to define distribution 
outgoing_calls['calls_count_z']=(outgoing_calls['calls_count']-outgoing_calls['calls_count'].mean())/outgoing_calls['calls_count'].std()
outgoing_calls['operator_id'] = outgoing_calls['operator_id'].astype(str)
outgoing_calls.head(3)

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

#distinguishing z-score by colors: red for negative values, green for positive ones 
outgoing_calls_z = outgoing_calls.query('-0.03 <=calls_count_z<=0.05').sort_values(by='calls_count', ascending=False).reset_index(drop=True)
outgoing_calls_z['colors'] = ['red' if x<0 else 'teal' for x in outgoing_calls_z['calls_count_z']]
outgoing_calls_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=outgoing_calls_z.operator_id, xmin=0, xmax=outgoing_calls_z.calls_count_z, colors=outgoing_calls_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators calls number relation to mean value of the call number \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least effective operators IDs that make the smallest number of outgoing calls (those with negative z-score):

operators_min_calls = outgoing_calls.query('calls_count_z<=0')['operator_id']
operators_min_calls = operators_min_calls.astype(int).values.tolist()
operators_min_calls

[940622,
 919390,
 953462,
 937780,
 941384,
 882684,
 923528,
 918436,
 928886,
 965542,
 938080,
 883940,
 939738,
 938074,
 939476,
 952914,
 906404,
 899788,
 937708,
 952114,
 901992,
 939474,
 924934,
 930020,
 937854,
 947304,
 905870,
 958468,
 905432,
 921306,
 925826,
 900826,
 906394,
 939370,
 912296,
 947652,
 891154,
 938072,
 928888,
 891988,
 947644,
 900354,
 958418,
 948182,
 913942,
 937788,
 965538,
 907986,
 906400,
 906396,
 896382,
 919128,
 925828,
 905574,
 913938,
 948758,
 958452,
 899964,
 927916,
 941390,
 952116,
 954642,
 914050,
 939218,
 962902,
 912722,
 940438,
 937862,
 947598,
 940596,
 899968,
 890422,
 944222,
 891160,
 905932,
 906412,
 918452,
 930914,
 903292,
 944646,
 947640,
 960294,
 936382,
 914238,
 888534,
 909910,
 919126,
 928518,
 932836,
 947600,
 958478,
 933806,
 958416,
 940614,
 919910,
 906294,
 919218,
 962658,
 905930,
 900352,
 964716,
 958430,
 938022,
 937988,
 924950,
 907224,
 890618,
 891152,
 924356,
 915360,
 939886,
 920852,
 940442,
 937810,
 887278,
 953464,
 918390,
 915358,
 895776,
 891158,
 920666,
 934426,
 937868,
 944764,
 921320,
 904202,
 902608,
 930586,
 969272,
 896016,
 924948,
 924960,
 944220,
 900492,
 900894,
 932498,
 924930,
 904038,
 907182,
 922474,
 920720,
 937732,
 924958,
 891166,
 900594,
 899972,
 922476,
 930912,
 896008,
 912252,
 922428,
 951506,
 924940,
 888532,
 910520,
 968338,
 880022,
 891170,
 948756,
 906296,
 937762,
 907952,
 905104,
 951648,
 915558,
 907970,
 960672,
 937812,
 901584,
 952460,
 920838,
 940432,
 917850,
 935348,
 906680,
 943066,
 944218,
 896428,
 951492,
 941628,
 932050,
 937980,
 944644,
 907440,
 937678,
 947590,
 920732,
 907178,
 916424,
 951508,
 960296,
 944766,
 959312,
 958434,
 896708,
 926312,
 914630,
 898906,
 958432,
 926490,
 939684,
 964510,
 944216,
 887802,
 924954,
 923250,
 937874,
 938612,
 940434,
 940952,
 908960,
 958460,
 892538,
 920706,
 933810,
 939478,
 958480,
 937774,
 900592,
 896390,
 947636,
 891162,
 895774,
 901586,
 937650,
 947612,
 900788,
 920728,
 905570,
 916618,
 910532,
 924952,
 907994,
 932292,
 907442,
 907982,
 910524,
 944226,
 933244,
 929228,
 919190,
 911310,
 968150,
 908958,
 921116,
 962268,
 960648,
 895172,
 940460,
 892532,
 920146,
 972410,
 950444,
 896386,
 891252,
 940630,
 897030,
 898892,
 921316,
 906416,
 958440,
 917856,
 890228,
 906408,
 938842,
 929544,
 954750,
 937156,
 929876,
 940458,
 943050,
 907180,
 972412,
 898414,
 924546,
 936324,
 907964,
 953458,
 940588,
 945728,
 930590,
 924956,
 919552,
 906070,
 950672,
 903312,
 969386,
 930594,
 919792,
 960620,
 958444,
 930186,
 901588,
 888868,
 890582,
 900790,
 904204,
 914338,
 915614,
 937604,
 886146,
 887992,
 900746,
 893402,
 937750,
 967310,
 888536,
 880240,
 906392,
 897872,
 925134,
 935264,
 958672,
 908078,
 920726,
 917680,
 952466,
 944210,
 924928,
 938414,
 921814,
 937864,
 930908,
 906398,
 937984,
 900194,
 939708,
 892798,
 965232,
 914240,
 944648,
 952656,
 937422,
 924946,
 966158,
 915504,
 901038,
 894224,
 932304,
 937430,
 955032,
 972460,
 935216,
 940456,
 937352,
 926184,
 891824,
 916596,
 880020,
 891946,
 891746,
 947818,
 900458,
 930524,
 938614,
 947816,
 910594,
 900614,
 965328,
 940430,
 890232,
 937898,
 901996,
 896384,
 935244,
 894662,
 945046,
 928228,
 907504,
 910926,
 884408,
 906420,
 929884,
 937856,
 955090,
 958454,
 882688,
 917846,
 940486,
 924932,
 930662,
 894226,
 958472,
 937782,
 956268,
 893420,
 898902,
 940474,
 946582,
 919794,
 919790,
 901178,
 951408,
 951290,
 908082,
 881278,
 919166,
 939736,
 929626,
 905934,
 904044,
 897894,
 894934,
 943818,
 930660,
 906410,
 918956,
 906402,
 932750,
 939716,
 894232,
 908002,
 924614,
 947610,
 894230,
 957022,
 915958,
 917446,
 944476,
 914636,
 909308,
 928926,
 917378,
 920928,
 940440,
 900742,
 955082,
 958470,
 920902,
 921102,
 946454,
 937870,
 951046,
 924544,
 937432,
 910902,
 956298,
 957922,
 922698,
 932088,
 937786,
 924370,
 891156,
 901034,
 944474,
 959118,
 958640,
 933996,
 888540,
 940788,
 892534,
 930910,
 913868,
 969294,
 898422,
 945052,
 896430,
 914848,
 939718,
 933986,
 970242,
 914168,
 934476,
 898434,
 917252,
 951650,
 953362,
 922114,
 941844,
 895370,
 969600,
 892536,
 919204,
 970254,
 891970,
 923254,
 919164,
 948150,
 905472,
 955820,
 941838,
 921104,
 959596,
 941736,
 960950,
 937368,
 892530,
 904200,
 907972,
 941740,
 919554,
 956276,
 887280,
 898920,
 909452,
 969286,
 937428,
 954378,
 918958,
 932798,
 926872,
 928382,
 914816,
 891908,
 970484,
 951186,
 939212,
 917682,
 952916,
 896018,
 896014,
 940568,
 905840,
 964712,
 900792,
 892800,
 942174,
 970486,
 888406,
 930692,
 940436,
 961552,
 940802,
 940782,
 914266,
 939236,
 947480,
 946988,
 969268,
 944228,
 904344,
 951332,
 913474,
 953460,
 954376,
 888538,
 972408,
 954380,
 884402,
 909134,
 932856,
 914870,
 937362,
 933452,
 921062,
 906418,
 930690,
 936296,
 893700,
 929622,
 909892,
 937900,
 932676,
 914426,
 914170,
 941842,
 890226,
 885682,
 937370,
 960670,
 937742,
 922486,
 918888,
 948170,
 955164,
 914348,
 905470,
 905308,
 908984,
 973120,
 901492,
 970252,
 893172,
 919536,
 932754,
 945898,
 896010,
 945074,
 891744,
 908718,
 927142,
 930814,
 967204,
 893400,
 928282,
 928284,
 907998,
 903978,
 970244,
 929542,
 930510,
 969288,
 962904,
 883018,
 934188,
 934000,
 960674,
 891250,
 917876,
 891976,
 955068,
 899076,
 895598,
 940842,
 940848,
 941810,
 919418,
 940814,
 955094,
 950806,
 956292,
 944560,
 945058,
 924572,
 966686,
 944246,
 932246,
 914626,
 914440,
 969262,
 941826,
 930816,
 890234,
 952196,
 969284,
 945274,
 954086,
 930326,
 895266,
 970240,
 946072,
 970250,
 947596,
 970258,
 891918,
 908080,
 947820,
 949420,
 891192,
 882478,
 902238,
 935246,
 937770,
 914272,
 937956,
 939698,
 940610,
 937872,
 905480,
 958458,
 900192,
 914036,
 905862,
 906076,
 913984,
 937778,
 937772,
 955086,
 935548,
 937720,
 918988,
 937716,
 937710,
 919214,
 940828,
 937366,
 941816,
 883898,
 954318,
 963716,
 919206,
 914256,
 935956,
 954284]

len(operators_min_calls)

674

This way we have received 674 operators IDs that perform fewer number of outgoing calls (smaller than the average number).

internal calls ¶

The less internal calls an operator has the more effective it is considered to be.

internal_calls = df_dataset.query('internal==True and direction=="out"').groupby('operator_id')['calls_count']\
.sum().reset_index().sort_values(by='calls_count', ascending=False)
internal_calls.head()

plt.figure(figsize=(15,3))
internal_calls['calls_count'].plot.hist(density=True, bins= 50, color='#cce1e1', log=True)
plt.title('Number of internal calls')
plt.show()

# box-and-whisker plot
plt.figure(figsize=(15,3))
ax = sns.boxplot(x="calls_count", data=internal_calls, color='#cce1e1') 
ax.set(xlabel='number of internal calls')
plt.show()

internal_calls['calls_count'].describe()

count    486.000000
mean      24.411523
std       52.832700
min        1.000000
25%        2.000000
50%        5.000000
75%       18.750000
max      393.000000
Name: calls_count, dtype: float64

#Calculatinug the 95th and 99th percentiles for the number of outgoing calls (setting out outliers). 
np.percentile(internal_calls['calls_count'], [95, 99])

array([135. , 279.8])

In general 486 operators had from 1 to 393 internal outgoing calls during the period. Middle 50% of them had from 2 to 18 internal calls with median value of 5 calls.
Only 5% of operators had over 135 internal calls, and 1% of operators had over 280 outgoing calls.

Z-score¶

We will define how effective every operator is in relation to the number of his/her internal calls. To do that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of number of internal calls of an operator is identical to the mean score.

#counting z-score for sales to define distribution 
internal_calls['internal_calls_z']=(internal_calls['calls_count']-internal_calls['calls_count'].mean())/internal_calls['calls_count'].std()
internal_calls['operator_id'] = internal_calls['operator_id'].astype(str)
internal_calls.head(3)

Now we will plot part of the distribution (operators with z-score close to 0) to illustrate the calculations.

#distinguishing z-score by colors: green for negative values (fewer internal calls), red for positive ones (more internal calls)
internal_calls_z = internal_calls.query('-0.15 <=internal_calls_z<=0.2')\
.sort_values(by='internal_calls_z', ascending=False).reset_index(drop=True)
internal_calls_z['colors'] = ['teal' if x<0 else 'red' for x in internal_calls_z['internal_calls_z']]
internal_calls_z['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=internal_calls_z.operator_id, xmin=0, xmax=internal_calls_z.internal_calls_z, colors=internal_calls_z.colors, alpha=0.4, linewidth=10)
plt.title('Operators internal calls relation to mean value of the internal calls \n(z-score)', fontsize=18)
plt.xlabel('z-score', fontsize=18)
plt.ylabel('Operators IDs', fontsize=16)

plt.show()

Now we can create a list of the least effective operators IDs that have the biggest number of internal outgoing calls (those with positive z-score):

operators_max_internal = internal_calls.query('internal_calls_z>=0')['operator_id']
operators_max_internal = operators_max_internal.astype(int).values.tolist()
operators_max_internal

[907986,
 906394,
 905574,
 906396,
 905566,
 906406,
 887276,
 906412,
 905538,
 915358,
 906400,
 887278,
 924356,
 896008,
 905564,
 907182,
 918452,
 924934,
 906294,
 888532,
 951506,
 907952,
 907970,
 915360,
 906296,
 912722,
 951508,
 891414,
 907440,
 924950,
 905542,
 901884,
 958430,
 905570,
 906404,
 907994,
 888534,
 909910,
 951492,
 907982,
 924960,
 958434,
 952468,
 958440,
 891410,
 917850,
 896016,
 907224,
 906416,
 905104,
 932836,
 958460,
 906408,
 924958,
 879896,
 922476,
 907964,
 903318,
 959312,
 924930,
 958444,
 907178,
 922474,
 962658,
 916424,
 888536,
 891416,
 907442,
 921316,
 906392,
 958432,
 924954,
 900180,
 958672,
 924940,
 924948,
 924952,
 958452,
 902744,
 907180,
 901880,
 894224,
 962902,
 940616,
 953462,
 937430,
 921116,
 953464,
 906398,
 924928,
 937422,
 920146,
 893804,
 952948,
 944648,
 919218,
 906680,
 952458,
 906420]

len(operators_max_internal)

99

This way we have received 99 operators IDs that have maximal number of internal calls (bigger or equal to the average one).

All in all we have 4 lists of operators IDs that perform ineffectively in 4 sphears:

miss calls,
do not pick up for a longer time;
do not make anough of outgoing calls;
call often to their colleagues.

On the next step we will group these IDs into 3 groups.

Ineffective Operators ¶

1. The least effective operators:¶

Defining operators that performed poorly in all 4 spheres (their IDs got into every of 4 lists)

#Operators that are in all 4 lists:
worst_operators = set(operators_with_missed) & set(operators_max_waiting)\
& set(operators_min_calls) &set(operators_max_internal)
print('The number of least effective operators:', len(worst_operators))

The number of least effective operators: 17

#IDs
display(worst_operators)

{888532,
 888534,
 906396,
 906400,
 906404,
 906408,
 906680,
 907952,
 907982,
 924928,
 924930,
 924948,
 924954,
 924960,
 951508,
 958434,
 958460}

There are 17 the least effective operators: they have missed calls, their waiting time is above average, number of their outgoing calls is less than average and finally they call a lot to their colleagues.
Denerally such performance might be caused by lack of experience and qualification.
Probably these operators need some additional taining if they already work for the company long enough. If they are newcomers some supervision and proffetional guidance might help improve their performance.

2. Operators of lower effectiveness:¶

Defining operators that performed poorly in 2 out of 4 spheres (their IDs got into 2 lists). We think that the most important criterion of effectivness are number of outgoing calls and waiting time.
We suppose that missed calls do not define effectiveness of an operator much because operators do not tend to miss the calls at all. Middle 50% of them have from 0 to 1 missed call and only 5% of operators have over 4 missed calls. This cannot be crucial for the current analysis.
As to the internal calls, it can also be omitted at this stage. There are only 99 operators who make them often. Most of them could be newcomers.
We will define operators that got into these two lists of ineffectiveness (number of outgoing calls and waiting time):

#Operators that are in 2 lists:
bad_operators = set(operators_max_waiting) & set(operators_min_calls)
print('The number of operators of lower effectivness:', len(bad_operators))

The number of operators of lower effectivness: 155

3. Operators of average effectiveness:¶

Defining operators that performed poorly in 1 of 4 spheres (their IDs got into 1 of 4 lists)

#Operators that are at least in one of the lists:
#joining the 4 lists 

avg_effective_operators = list(set(operators_with_missed) | set(operators_max_waiting) | set(operators_min_calls) |set(operators_max_internal))

len(avg_effective_operators)

857

display(avg_effective_operators)

[937984,
 919552,
 882688,
 919554,
 958468,
 937988,
 958470,
 897030,
 958472,
 954376,
 954378,
 882690,
 954380,
 899076,
 958478,
 899082,
 958480,
 935956,
 935958,
 966686,
 888868,
 938022,
 915504,
 901178,
 964666,
 948284,
 948286,
 948288,
 913474,
 956484,
 909392,
 944210,
 905300,
 929876,
 938072,
 944216,
 938074,
 944218,
 929884,
 905308,
 942174,
 944220,
 938080,
 944222,
 933986,
 962658,
 944226,
 915556,
 944228,
 915558,
 964712,
 933996,
 960620,
 886892,
 964716,
 925808,
 934000,
 944246,
 903292,
 925826,
 925828,
 960648,
 952458,
 909452,
 952460,
 952462,
 903312,
 952466,
 952468,
 903318,
 915614,
 960670,
 960672,
 960674,
 895140,
 927910,
 927912,
 950444,
 927916,
 958640,
 917680,
 917682,
 907440,
 899250,
 907442,
 934074,
 934076,
 899268,
 895172,
 958672,
 934098,
 932050,
 905430,
 921814,
 905432,
 921818,
 925922,
 930020,
 901350,
 907502,
 919790,
 919792,
 907504,
 919794,
 893172,
 932088,
 905470,
 905472,
 905478,
 905480,
 891152,
 954642,
 891154,
 891156,
 946454,
 891158,
 891160,
 954650,
 891162,
 891166,
 891170,
 895266,
 946468,
 934188,
 909624,
 891192,
 973120,
 905538,
 905542,
 883018,
 952656,
 962902,
 917846,
 962904,
 903512,
 917850,
 944474,
 917852,
 905564,
 971102,
 928092,
 917856,
 944476,
 905570,
 907618,
 905566,
 905574,
 952678,
 919910,
 936296,
 891250,
 901492,
 913780,
 917876,
 891252,
 901498,
 913788,
 954750,
 936324,
 930186,
 895370,
 940430,
 950672,
 940432,
 940434,
 940436,
 946582,
 940438,
 932246,
 940440,
 940442,
 940456,
 940458,
 940460,
 938414,
 944560,
 960950,
 919994,
 940474,
 936382,
 938432,
 930242,
 932292,
 940486,
 909768,
 913868,
 901584,
 932304,
 901586,
 901588,
 893400,
 893402,
 940508,
 913886,
 940512,
 928228,
 973286,
 926184,
 893420,
 887276,
 887278,
 887280,
 887282,
 915958,
 922114,
 944644,
 944646,
 944648,
 913938,
 891410,
 948756,
 913942,
 950806,
 934424,
 930326,
 934426,
 948758,
 934428,
 891414,
 934430,
 940568,
 934432,
 928282,
 928284,
 967204,
 940588,
 969262,
 940596,
 969268,
 969272,
 913984,
 936514,
 940610,
 909892,
 969284,
 940614,
 969286,
 940616,
 969288,
 934476,
 940622,
 969294,
 940624,
 952914,
 920146,
 952916,
 909910,
 940630,
 895574,
 895576,
 940634,
 957022,
 926312,
 940652,
 932460,
 895598,
 905840,
 965232,
 905842,
 940658,
 938612,
 914036,
 938614,
 952948,
 944764,
 881278,
 944766,
 928382,
 914050,
 905862,
 934534,
 912010,
 905870,
 967310,
 959118,
 932498,
 932500,
 955032,
 969380,
 969386,
 907952,
 950972,
 955068,
 907964,
 910018,
 907970,
 924356,
 907972,
 905930,
 955082,
 905932,
 899788,
 930510,
 905934,
 955086,
 907982,
 955090,
 965328,
 907986,
 924370,
 955094,
 907994,
 930524,
 907998,
 908002,
 940782,
 922356,
 940788,
 901880,
 914168,
 914170,
 901884,
 901886,
 891646,
 940802,
 893700,
 901894,
 951046,
 928518,
 940814,
 926490,
 930586,
 955164,
 940828,
 930590,
 895774,
 895776,
 930594,
 895782,
 895786,
 903978,
 946988,
 940842,
 908078,
 940848,
 908080,
 908082,
 899892,
 899898,
 922428,
 899900,
 914238,
 914240,
 908098,
 899906,
 932676,
 899912,
 951118,
 959312,
 897872,
 914256,
 904016,
 906070,
 938842,
 914266,
 906076,
 914272,
 891744,
 908130,
 891746,
 930660,
 930662,
 897894,
 901992,
 904038,
 922474,
 893804,
 922476,
 901996,
 904044,
 918390,
 922486,
 908152,
 899964,
 912252,
 899968,
 924544,
 930690,
 924546,
 930692,
 908162,
 969600,
 899972,
 932750,
 938896,
 910226,
 951186,
 932754,
 945046,
 940952,
 889754,
 945052,
 924572,
 914338,
 945058,
 918436,
 965538,
 965542,
 912296,
 914348,
 882684,
 891824,
 885682,
 945074,
 918452,
 932798,
 924614,
 916424,
 943050,
 943066,
 932836,
 924662,
 932856,
 914426,
 951290,
 887802,
 891900,
 930814,
 930816,
 891906,
 891908,
 914440,
 904200,
 896008,
 896010,
 904202,
 896012,
 904204,
 891918,
 908300,
 961552,
 896014,
 896018,
 896016,
 953362,
 951332,
 891946,
 955436,
 906294,
 906296,
 891970,
 891976,
 922698,
 891988,
 900180,
 920666,
 930908,
 930910,
 902238,
 930912,
 900192,
 930914,
 900194,
 947304,
 959596,
 951408,
 953458,
 885876,
 916596,
 953460,
 928886,
 953462,
 928888,
 953464,
 945274,
 885890,
 920706,
 963716,
 916618,
 920720,
 920726,
 906392,
 926872,
 906394,
 920728,
 906396,
 904344,
 906398,
 920732,
 906400,
 928926,
 906402,
 906404,
 906406,
 906408,
 894120,
 906410,
 906412,
 949420,
 906416,
 906418,
 910516,
 906420,
 904372,
 910520,
 887992,
 883898,
 910524,
 951490,
 914626,
 951492,
 910532,
 914630,
 937156,
 910540,
 939212,
 914636,
 933070,
 951506,
 939218,
 951508,
 939224,
 958456,
 939236,
 883940,
 958460,
 900352,
 924928,
 910594,
 900354,
 924930,
 924932,
 924934,
 920838,
 924936,
 924940,
 894224,
 894226,
 924946,
 920852,
 924948,
 894230,
 924950,
 894232,
 947480,
 924952,
 924954,
 924956,
 879896,
 924958,
 924960,
 920874,
 912684,
 935216,
 920902,
 941384,
 935244,
 935246,
 941390,
 912722,
 927070,
 951648,
 935264,
 951650,
 920928,
 920930,
 918888,
 939370,
 900458,
 902510,
 898414,
 939376,
 890226,
 890228,
 898422,
 890232,
 890234,
 933244,
 896382,
 914816,
 896384,
 886146,
 898434,
 902532,
 896386,
 947590,
 896390,
 937352,
 947596,
 900492,
 947598,
 947600,
 937362,
 880020,
 920982,
 937366,
 937368,
 880022,
 914842,
 947610,
 947612,
 937370,
 937374,
 914848,
 927142,
 918956,
 955820,
 918958,
 896428,
 896430,
 908718,
 925106,
 935348,
 947636,
 914870,
 906680,
 947640,
 947644,
 918978,
 947652,
 918988,
 929228,
 925134,
 937422,
 902608,
 939474,
 939476,
 937428,
 939478,
 968150,
 937430,
 937432,
 957922,
 921062,
 900592,
 900594,
 898558,
 898560,
 970240,
 970242,
 970244,
 900614,
 970250,
 970252,
 921102,
 970254,
 921104,
 966158,
 970258,
 896536,
 896538,
 921116,
 908834,
 890404,
 890406,
 890422,
 910902,
 929340,
 941628,
 945728,
 933452,
 910926,
 919126,
 888406,
 919128,
 902744,
 947816,
 947818,
 947820,
 910958,
 880240,
 923250,
 892530,
 892532,
 923254,
 892534,
 972408,
 892536,
 972410,
 892538,
 919164,
 972412,
 919166,
 935548,
 931458,
 937604,
 900742,
 900746,
 929424,
 968338,
 894614,
 919190,
 908958,
 908960,
 919204,
 939684,
 919206,
 941736,
 972460,
 941740,
 919214,
 956080,
 968368,
 919218,
 939698,
 937650,
 884402,
 900788,
 900790,
 884408,
 900792,
 939706,
 908984,
 939708,
 884412,
 894656,
 939716,
 896708,
 894662,
 939718,
 943818,
 937678,
 888532,
 890582,
 888534,
 939736,
 888536,
 939738,
 921306,
 888538,
 962268,
 888540,
 900826,
 921316,
 921318,
 954086,
 921320,
 945898,
 937708,
 909308,
 937710,
 941810,
 970484,
 937716,
 970486,
 937720,
 941816,
 890618,
 911102,
 902910,
 941826,
 937732,
 917252,
 929542,
 937736,
 929544,
 937742,
 941838,
 941842,
 941844,
 937750,
 900892,
 900894,
 911136,
 937760,
 911138,
 937762,
 911140,
 911142,
 960294,
 960296,
 937770,
 882476,
 937772,
 882478,
 937774,
 911152,
 952114,
 937778,
 952116,
 937780,
 937782,
 937786,
 937788,
 891416,
 898892,
 909134,
 937808,
 937810,
 937812,
 898902,
 929622,
 898906,
 929626,
 919390,
 898920,
 956268,
 911214,
 939886,
 956276,
 919418,
 892798,
 937854,
 937856,
 892800,
 917378,
 956292,
 952196,
 923526,
 937860,
 923528,
 937862,
 956298,
 937864,
 937868,
 937870,
 905104,
 937872,
 886674,
 937874,
 946072,
 964510,
 915358,
 915360,
 907174,
 964518,
 907178,
 937898,
 907180,
 901034,
 907182,
 937900,
 954284,
 937902,
 901038,
 933806,
 954290,
 933810,
 948150,
 917446,
 933832,
 948170,
 911310,
 954318,
 958416,
 958418,
 948182,
 894934,
 907224,
 958430,
 958432,
 958434,
 952292,
 937956,
 937958,
 937960,
 958440,
 937962,
 958444,
 937966,
 919536,
 958452,
 958454,
 882680,
 958458,
 937980,
 882686]

#all_operators= df_dataset.groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
#missed_calls['is_missed_call'] = missed_calls['is_missed_call'].astype(int)
#missed_calls

#list of all operators:
all_operators = df_dataset['operator_id'].unique()
all_operators = all_operators.tolist()

#operators that never got to any of the lists:
good_operators = list(set(all_operators) - set(avg_effective_operators))
good_operators

[930818,
 930820,
 949252,
 926214,
 962062,
 923666,
 896020,
 936984,
 922142,
 890402,
 961064,
 968232,
 922154,
 890410,
 891948,
 890412,
 945710,
 890416,
 890420,
 929332,
 928312,
 948282,
 956480,
 902720,
 889410,
 909894,
 884294,
 952392,
 909896,
 931914,
 953940,
 903254,
 938070,
 922710,
 902742,
 895578,
 971354,
 904284,
 902746,
 938078,
 902748,
 923744,
 953952,
 901734,
 945308,
 906866,
 905844,
 898676,
 902774,
 887416,
 902776,
 952954,
 919162,
 892028,
 902778,
 945278,
 902782,
 945280,
 962168,
 945282,
 914052,
 945284,
 942214,
 945286,
 952968,
 941192,
 945288,
 945290,
 945294,
 945296,
 945298,
 929426,
 928404,
 919188,
 932502,
 952982,
 945304,
 945302,
 929428,
 919192,
 919194,
 919196,
 945310,
 919198,
 945312,
 919200,
 945314,
 919202,
 927908,
 945316,
 945318,
 944808,
 945320,
 945322,
 945324,
 936110,
 944310,
 960698,
 909502,
 895170,
 910530,
 907974,
 888520,
 899790,
 934608,
 884946,
 910038,
 939222,
 941786,
 910044,
 903390,
 946916,
 940772,
 883942,
 951526,
 945894,
 893674,
 961770,
 945900,
 945902,
 945904,
 917234,
 939762,
 956664,
 954616,
 902906,
 884478,
 911104,
 919302,
 919306,
 944908,
 919310,
 919314,
 926486,
 930582,
 919318,
 937752,
 934170,
 879898,
 889638,
 903974,
 923944,
 889640,
 903976,
 948524,
 884524,
 965942,
 951614,
 952126,
 932672,
 919362,
 919364,
 919370,
 919372,
 904014,
 919374,
 919376,
 919378,
 919382,
 919896,
 952666,
 971100,
 899932,
 920414,
 920416,
 908640,
 917858,
 919906,
 935268,
 946020,
 950632,
 950634,
 950640,
 946032,
 904058,
 947592,
 928142,
 910224,
 919956,
 908180,
 919958,
 947604,
 937880,
 880026,
 880028,
 937372,
 952734,
 947614,
 947616,
 919456,
 947618,
 937888,
 965540,
 947620,
 940968,
 965544,
 919464,
 933804,
 925104,
 944564,
 919476,
 947638,
 954810,
 947642,
 919482,
 958394,
 947646,
 935870,
 947648,
 947650,
 919490,
 947654,
 928202,
 918986,
 947658,
 939470,
 919504,
 930264,
 940514,
 935394,
 921574,
 921584,
 921592,
 921594,
 921596]

len(good_operators)

225

There are 225 operators that never got to any of the lists; they can be considered as the most effective ones.

#Creating a db for visualisation of results:
data_op=[{'avg_effective_operators': len(avg_effective_operators), 'bad_operators':len(bad_operators), \
          'worst_operators': len(worst_operators)
         }]
operators_funnel = pd.DataFrame(data_op).T.reset_index()
operators_funnel.columns=['operators_groups', 'operators_number']

operators_funnel

#chart with 4 groups :

fig = go.Figure(go.Funnel(y = operators_funnel.operators_groups, x = operators_funnel.operators_number, 
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#9467bd', '#3ac7c3' , '#72B7B2', '#c6f7e7', '#C9FBE5']}))
                                                 
fig.show()

Combining the 4 lists of operators with the least effective performace in one of the spheres we have received 3 groups:

The least effective operators - worst_operators - list of 17 IDs
Operators of lower effectivness - bad_operators - list of 155
Operators of average effectivness - avg_effective_operators - list of 857.

Studying the features for different tarrif plans¶

Merging the tables

#general db (merged)
df = df_dataset.merge(df_cl, on='user_id')
df.head(3)

#Share of missed calls per tariff 
tariffs_missed_calls = df.groupby('tariff_plan')['is_missed_call']\
.mean().reset_index().sort_values(by='is_missed_call', ascending=False)\
.rename({'is_missed_call': 'missed_calls_share'}, axis=1)
tariffs_missed_calls

#Share of internal calls per tariff 
tariffs_internal_calls = df.groupby('tariff_plan')['internal'].mean().reset_index() 
tariffs_internal_calls

#db for graphical interpretation of the features distribution
df_graphs = df

#db ajustment for graphical interpretation
df_graphs['internal'] = df_graphs['internal'].replace(to_replace=True, value ="internal").replace(to_replace=False, value ="external")
df_graphs['is_missed_call'] = df_graphs['is_missed_call'].replace(to_replace=True, value ="missed_call").replace(to_replace=False, value ="answered")
df_graphs.insert(0, "count", 1)
df_graphs.head()

#plotting features distribution
parameters_hist = ['tariff_plan', 'direction', 'internal', 'is_missed_call']
parameters_dis = ['date_start']

for x in parameters_hist:
    plt.figure(figsize=(18,5))
    ax = sns.barplot(x=x, y="count", data=df_graphs, hue='tariff_plan', estimator=sum, palette="mako") 
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    plt.title('Distribution of {}'.format(x))        
    plt.show()
            
for x in parameters_dis:
    plt.figure(figsize=(18,5))
    sns.kdeplot(data=df_graphs, x=x, hue="tariff_plan", alpha = 0.6 , palette="mako", fill=True)       
    plt.show()

Distibution of the parameter 'calls_count'¶

df_graphs

#distribution of calls_count
def logx_kde(x, y, xmin, xmax, ymin, ymax): #a function that applies the logarithm to the x-data
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='calls_count', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')

# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="calls_count", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='number of calls')
plt.show()

Distibution of the parameter 'call_duration'¶

#distribution of call_duration
def logx_kde(x, y, xmin, xmax, ymin, ymax):
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='call_duration', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')

# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="call_duration", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='call duration, sec.')
plt.show()

Distibution of the parameter 'total_call_duration'¶

#distribution of total_call_duration
def logx_kde(x, y, xmin, xmax, ymin, ymax):
    x = np.log10(x)

    # Peform the kernel density estimate
    xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    f = np.reshape(kernel(positions).T, xx.shape)
    return np.power(10, xx), yy, f
plt.figure(figsize=(18,5))
ax=sns.kdeplot(data=df_graphs, x='total_call_duration', hue="tariff_plan", hue_order=["A", "C", "B"], alpha = 0.6 , palette="mako", fill=True)
ax.set_xscale('log')

# box-and-whisker plot
plt.figure(figsize=(18,3))
ax = sns.boxplot(x="total_call_duration", y='tariff_plan', data=df_graphs, color='#cce1e1') 
ax.set(xlabel='total call duration, sec.')
plt.show()

Description of features distribution.

There are 3 types of tarrif plan. The biggest number of calls have been done on tarrif plan C (13293 calls), slightly less on plan B (13024) and the smallest number - on plan A (10995).
The users used all 3 plans from beginning of the August. The beggerst number of users registrated for plan B from 1/08/2019 till 18/08/2019; Peack of registrations for plan C was during 15/08/2019-01/09/2019;
peak of registrations for plan A was the shortest: from 15/10/2019 till 20/11/2015.
The biggest number of outgoing calls are made on tariff B; the smallest number is on plan A. The biggest number and the biggest share of incoming call sare made on plan C.
The operators on plan B have the lowest share of internal calls (11%) and operators on plan A have the highest share of them (16%). It might be a sign of higher professionalism of those who work with plan B and vice versa the lower professional level of operators on plan A.

Plan C has the biggest number of answered calls, A - the lowest. Plan A has the highest share of missed calls (35%), plan C has the lowest share of them (32%).
For all tariff plans share of missed calls is pretty high, every third call gets no answer. Phone company might think of employing additional operators to get more calls answered and lower waiting time.
Another option may be outsoursing operators during the peak periods, e.g. when a company runs some marketing campaigns and expects a large number of registrations and new-commers.
Calls count. Operators on plan B make the biggest number of calls, C makes the fewest. A and C have long tail to the right and lots of of outliers.
Call_duration. Call duration looks standart for all tariffs but tariff plan A has a very long tail to the right with great amount of outliers.
Total_call_duration. Distribution of the feature looks similar to the call_duration distribution, but average duration for tariff A is higher.

Parameters of efficiency by tariff plans ¶

#creating a separate db
df_tariffs = df
#substituting words by numbers for chart
df_tariffs['is_missed_call'] = np.where(df_tariffs['is_missed_call'] == "missed_call", 1, 0).astype('int', copy=False)
#adding parameter of waiting time:
df_tariffs['avg_waiting_time'] = round(((df_tariffs['total_call_duration'] - df_tariffs['call_duration']) /df_tariffs['calls_count']), 2)
df_tariffs.head()

df_A = df_tariffs.query('tariff_plan == "A"')
df_B = df_tariffs.query('tariff_plan == "B"')
df_C = df_tariffs.query('tariff_plan == "C"')

df_list = [df_A, df_B, df_C]

df_A.name = 'df_A'
df_B.name = 'df_B'
df_C.name = 'df_C'

for i in df_list:
    print(i.name, i.shape[0])

df_A 11008
df_B 13038
df_C 13316

1. missed calls by tariff¶

#List of operators with missed calls for tariff A
missed_calls_A = df_A.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index()
missed_calls_A['is_missed_call'] = missed_calls_A['is_missed_call'].astype(int)
operators_with_missed_A = missed_calls_A.query('is_missed_call>0')['operator_id']
operators_with_missed_A.values.tolist()
print('Number of operators with missed calls for tariff A:', len(operators_with_missed_A))

Number of operators with missed calls for tariff A: 71

#List of operators with missed calls for tariff B
missed_calls_B = df_B.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls_B['is_missed_call'] = missed_calls_B['is_missed_call'].astype(int)
operators_with_missed_B = missed_calls_B.query('is_missed_call>0')['operator_id'].values.tolist()
print('Number of operators with missed calls for tariff B:', len(operators_with_missed_B))

Number of operators with missed calls for tariff B: 71

#List of operators with missed calls for tariff C
missed_calls_C = df_C.query('direction=="in"').groupby('operator_id')['is_missed_call'].sum().reset_index().sort_values(by='is_missed_call', ascending=False)
missed_calls_C['is_missed_call'] = missed_calls_C['is_missed_call'].astype(int)
operators_with_missed_C = missed_calls_C.query('is_missed_call>0')['operator_id']
operators_with_missed_C.values.tolist()
print('Number of operators with missed calls for tariff C:', len(operators_with_missed_C))

Number of operators with missed calls for tariff C: 88

2. waiting time by tariff¶

#A
waiting_calls_operators_A = df_A.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_A['waiting_time_z']=(waiting_calls_operators_A['avg_waiting_time']-waiting_calls_operators_A['avg_waiting_time'].mean())/waiting_calls_operators_A['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_A = waiting_calls_operators_A.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff A:', len(operators_max_waiting_A))

Number of operators with long waiting time for tariff A: 55

#B
waiting_calls_operators_B = df_B.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_B['waiting_time_z']=(waiting_calls_operators_B['avg_waiting_time']\
                                             -waiting_calls_operators_B['avg_waiting_time'].mean())\
/waiting_calls_operators_B['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_B = waiting_calls_operators_B.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff B:', len(operators_max_waiting_B))

Number of operators with long waiting time for tariff B: 75

#C
waiting_calls_operators_C = df_C.query('direction=="in"').groupby('operator_id').agg({'avg_waiting_time': 'sum'}).reset_index()
#calculating z-score
waiting_calls_operators_C['waiting_time_z']=(waiting_calls_operators_C['avg_waiting_time']\
                                             -waiting_calls_operators_C['avg_waiting_time'].mean())\
/waiting_calls_operators_C['avg_waiting_time'].std()
#picking up the operators
operators_max_waiting_C = waiting_calls_operators_C.query('waiting_time_z>=0')['operator_id'].values.tolist()
print('Number of operators with long waiting time for tariff C:', len(operators_max_waiting_C))

Number of operators with long waiting time for tariff C: 88

3. outgoing calls by tariff¶

#A
outgoing_calls_A = df_A.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_A['calls_count_z']=(outgoing_calls_A['calls_count']-outgoing_calls_A['calls_count'].mean())\
/outgoing_calls_A['calls_count'].std()
operators_min_calls_A = outgoing_calls_A.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff A:', len(operators_min_calls_A))

Number of operators with small number of outgoing calls for tariff A: 210

#B
outgoing_calls_B = df_B.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_B['calls_count_z']=(outgoing_calls_B['calls_count']-outgoing_calls_B['calls_count'].mean())\
/outgoing_calls_B['calls_count'].std()
operators_min_calls_B = outgoing_calls_B.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff B:', len(operators_min_calls_B))

Number of operators with small number of outgoing calls for tariff B: 248

#C
outgoing_calls_C = df_C.query('direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
outgoing_calls_C['calls_count_z']=(outgoing_calls_C['calls_count']-outgoing_calls_C['calls_count'].mean())\
/outgoing_calls_C['calls_count'].std()
operators_min_calls_C = outgoing_calls_C.query('calls_count_z<=0')['operator_id'].values.tolist()
print('Number of operators with small number of outgoing calls for tariff C:', len(operators_min_calls_C))

Number of operators with small number of outgoing calls for tariff C: 211

4. internal calls by tariff¶

#A
internal_calls_A = df_A.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_A['internal_calls_z']=(internal_calls_A['calls_count']-internal_calls_A['calls_count'].mean())\
/internal_calls_A['calls_count'].std()
operators_max_internal_A = internal_calls_A.query('internal_calls_z>=0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff A:', len(operators_max_internal_A))

Number of operators with big number of outgoing internal calls for tariff A: 33

#B
internal_calls_B = df_B.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_B['internal_calls_z']=(internal_calls_B['calls_count']-internal_calls_B['calls_count'].mean())\
/internal_calls_B['calls_count'].std()
operators_max_internal_B = internal_calls_B.query('internal_calls_z>0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff B:', len(operators_max_internal_B))

Number of operators with big number of outgoing internal calls for tariff B: 30

#C
internal_calls_C = df_C.query('internal=="internal" and direction=="out"').groupby('operator_id')['calls_count'].sum().reset_index()
#z-score
internal_calls_C['internal_calls_z']=(internal_calls_C['calls_count']-internal_calls_C['calls_count'].mean())\
/internal_calls_C['calls_count'].std()
operators_max_internal_C = internal_calls_C.query('internal_calls_z>0')['operator_id'].values.tolist()
print('Number of operators with big number of outgoing internal calls for tariff C:', len(operators_max_internal_C))

Number of operators with big number of outgoing internal calls for tariff C: 40

List of not effective operators by tariff plan ¶

for tariff A¶

#Operators that are in all 4 lists:
worst_operators_A = set(operators_with_missed_A) & set(operators_max_waiting_A)\
& set(operators_min_calls_A) &set(operators_max_internal_A)
print('The number of least effective operators for tariff A:', len(worst_operators_A))

The number of least effective operators for tariff A: 10

#Operators that are in 2 lists:
bad_operators_A = set(operators_max_waiting_A) & set(operators_min_calls_A)
print('The number of operators of lower effectivness:', len(bad_operators_A))

The number of operators of lower effectivness: 37

#Operators that are at least in one of the lists:

avg_effective_operators_A = list(set(operators_with_missed_A) | set(operators_max_waiting_A) \
                                 | set(operators_min_calls_A) |set(operators_max_internal_A))

print('The number of not very effective operators for tariff A:', len(avg_effective_operators_A))

The number of not very effective operators for tariff A: 234

for tariff B¶

#Operators that are in all 4 lists:
worst_operators_B = set(operators_with_missed_B) & set(operators_max_waiting_B)\
& set(operators_min_calls_B) &set(operators_max_internal_B)
print('The number of least effective operators for tariff B:', len(worst_operators_B))

The number of least effective operators for tariff B: 6

#Operators that are in 2 lists:
bad_operators_B = set(operators_max_waiting_B) & set(operators_min_calls_B)
print('The number of operators of lower effectivness:', len(bad_operators_B))

The number of operators of lower effectivness: 39

#Operators that are at least in one of the lists:

avg_effective_operators_B = list(set(operators_with_missed_B) | set(operators_max_waiting_B) \
                                 | set(operators_min_calls_B) |set(operators_max_internal_B))

print('The number of not very effective operators for tariff B:', len(avg_effective_operators_B))

The number of not very effective operators for tariff B: 297

for tariff C¶

#Operators that are in all 4 lists:
worst_operators_C = set(operators_with_missed_C) & set(operators_max_waiting_C)\
& set(operators_min_calls_C) &set(operators_max_internal_C)
print('The number of least effective operators for tariff C:', len(worst_operators_C))

The number of least effective operators for tariff C: 2

#Operators that are in 2 lists:
bad_operators_C = set(operators_max_waiting_C) & set(operators_min_calls_C)
print('The number of operators of lower effectivness:', len(bad_operators_C))

The number of operators of lower effectivness: 29

#Operators that are at least in one of the lists:

avg_effective_operators_C = list(set(operators_with_missed_C) | set(operators_max_waiting_C) \
                                 | set(operators_min_calls_C) |set(operators_max_internal_C))

print('The number of not very effective operators for tariff C:', len(avg_effective_operators_C))

The number of not very effective operators for tariff C: 287

#Creating a db for visualisation of results:
#A
data_op_tariff_A=[{'avg_effective_operators': len(avg_effective_operators_A), 'bad_operators':len(bad_operators_A), \
          'worst_operators': len(worst_operators_A)
                }]
operators_funnel_tariff_A = pd.DataFrame(data_op_tariff_A).T.reset_index()
operators_funnel_tariff_A.columns=['operators_groups', 'number']
operators_funnel_tariff_A.insert(0, "tariff", "A")

#B
data_op_tariff_B=[{'avg_effective_operators': len(avg_effective_operators_B), 'bad_operators':len(bad_operators_B), \
          'worst_operators': len(worst_operators_B)
                }]
operators_funnel_tariff_B = pd.DataFrame(data_op_tariff_B).T.reset_index()
operators_funnel_tariff_B.columns=['operators_groups', 'number']

operators_funnel_tariff_B.insert(0, "tariff", "B")

#C
data_op_tariff_C=[{'avg_effective_operators': len(avg_effective_operators_C), 'bad_operators':len(bad_operators_C), \
          'worst_operators': len(worst_operators_C)
                }]
operators_funnel_tariff_C = pd.DataFrame(data_op_tariff_C).T.reset_index()
operators_funnel_tariff_C.columns=['operators_groups', 'number']

operators_funnel_tariff_C.insert(0, "tariff", "C")

frames = [operators_funnel_tariff_A, operators_funnel_tariff_B, operators_funnel_tariff_C]

operators_funnel_tariffs = pd.concat(frames)
operators_funnel_tariffs

#chart with 4 groups :

fig = go.Figure(go.Funnel(y = operators_funnel_tariffs.operators_groups, x = operators_funnel_tariffs.number,
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#9467bd', '#3ac7c3' , '#72B7B2', '#c6f7e7', '#C9FBE5', 'teal', 'mediumturquoise', 'steelblue']}))
                      
fig.show()

Conclusion ¶

In course of work we have defined 3 groups of operators with different degree of inefficiency in general and for every tariff plan:

the least effective operators - that performed ineffectively in every of 4 spheres;
operators of lower effectivness - performed ineffectively in 2 spheres (longer waiting time and fewer phone calls);
operators of average effectivness - performed ineffectively in one of 4 spheres.

The biggest share of ineffective operators (those who got to every of 4 lists) refer to tariff plan A, and the smallest - to tariff plan C.
On average operators working with tariff plan A demontarte less professional work than their colleagues. The data shows that there is a big number of new users of tariff plan A joining the company in the middle of October. Probably the company had to employ new operators for tariff A or have not arranged enough traning for the old ones.
We would recommend the following:

monitor operators performance that got to group 'operators of average effectivness';
arrange additional traning for those who got to group 'operators of lower effectivness';
additional traning and supervised work for those who got to group 'the least effective operators'.

In the next part of the work we will test the hypothesis if the difference in opertators performance for different tariff plans has any statistical significance or this difference is duer to chance.

Part 3. Hypotheses testing ¶

back to top

previous part

We are going to test the following hypothesis: number of calls (incoming and outgoung) depens on tariff plan of the call.

Stating hypothesis:
H0: average number of calls per operator is equal for all tariff plans
H1: average number of calls differ for all tariff plans

#Getting rid of outliers

#defining outliers with 3-sigma method for all tariff plans
#as we are calulating std for sample ddof is set to 1

std_score_A = np.std(df_A['calls_count'], ddof=1)
three_sigma_score_A_lower = round((df_A['calls_count'].mean() - std_score_A*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_A_upper = round((df_A['calls_count'].mean() + std_score_A*3),2)



std_score_B = np.std(df_B['calls_count'], ddof=1)
three_sigma_score_B_lower = round((df_B['calls_count'].mean() - std_score_B*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_B_upper = round((df_B['calls_count'].mean() + std_score_B*3),2)


std_score_C = np.std(df_C['calls_count'], ddof=1)
three_sigma_score_C_lower = round((df_C['calls_count'].mean() - std_score_C*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 
three_sigma_score_C_upper = round((df_C['calls_count'].mean() + std_score_C*3),2)

print('99.7% of calls A are from 0 to ', three_sigma_score_A_upper,\
      '. \n99.7% of calls B are from 0 to ', three_sigma_score_B_upper,'.'\
     '\n99.7% of calls C are from 0 to ', three_sigma_score_C_upper,'.')

99.7% of calls A are from 0 to  289.84 . 
99.7% of calls B are from 0 to  59.98 .
99.7% of calls C are from 0 to  193.36 .

#setting df without outliers

df_A_no_outliers = df_A.query('calls_count<=@three_sigma_score_A_upper')

df_B_no_outliers = df_B.query('calls_count<=@three_sigma_score_B_upper')

df_C_no_outliers = df_C.query('calls_count<=@three_sigma_score_C_upper')

#defining Variance for the 3 atriffs to define whether or not we can consider them as equal for t-test.
variance_A = np.var(df_A_no_outliers['calls_count'])
variance_B = np.var(df_B_no_outliers['calls_count'])
variance_C = np.var(df_C_no_outliers['calls_count'])

print('Variance for A calls sample is ', variance_A,\
     '\nVariance for B calls sample is ', variance_B,\
     '\nVariance for C calls sample is ', variance_C,)

Variance for A calls sample is  1220.7079030525474 
Variance for B calls sample is  97.28893838599174 
Variance for C calls sample is  261.7183951236518

Variances vary greatly.

Testing if distribution is normal for 3 groups

sample_A = df_A_no_outliers['calls_count'].values.tolist()
sample_B = df_B_no_outliers['calls_count'].values.tolist()
sample_C = df_C_no_outliers['calls_count'].values.tolist()

samples_list = [sample_A, sample_B, sample_C]

def check_normality(sample, alpha):
#    alpha = .05 # significance level
    results = st.shapiro(sample)
    p_value = results[1] # the second value in the array of results (with index 1) - the p-value
    print('p-value: ', p_value)
    if (p_value < alpha):
        print("Null hypothesis rejected: the distribution is not normal")
    else:
        print("Failed to reject the null hypothesis: the distribution seems to be normal")

for i in samples_list:
    check_normality(i, alpha=0.05)

p-value:  0.0
Null hypothesis rejected: the distribution is not normal
p-value:  0.0
Null hypothesis rejected: the distribution is not normal
p-value:  0.0
Null hypothesis rejected: the distribution is not normal

C:\Users\HP\anaconda3\lib\site-packages\scipy\stats\morestats.py:1681: UserWarning:

p-value may not be accurate for N > 5000.

The calculations proved the distribution is not normal for all samples.
Consequently we should choose a non parametric test fo testing hypothesis (it doesn’t assume the data fits a specific distribution type). Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test.
We are going to calculate statistical significance of the difference in conversion between the groups using the Mann-Whitney U test because we have independent data samples with quantative variables (nonparametric version of the Student t-test), also the data doesn't follow normal distribution. We can see that the data is not normally distributed.

As we are going to run 3 comparisons (A/B, B/C, A/C) we will alter the sinificance level respectively, using the Šidák method. Normally Alpha is set to 0.5 as a commonly accepted level in industry.
Calculating alpha with The Šidák method (guarantees family-wise error rate < ɑ).
The required significance level will be found as follows:

alpha_sidak = 1 - (1 - 0.05)**(1/3) #we have 3 groups which gives us 3 combinations for comparison (246/247, 246/248, 247/248)
round(alpha_sidak,4)

0.017

Mann-Whitney U Test ¶

from scipy.stats import mannwhitneyu

#forming 3 groups:
callsByOperators_A = df_A_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})

callsByOperators_B = df_B_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})

callsByOperators_C = df_C_no_outliers.groupby('operator_id', as_index=False).agg({'calls_count' : 'sum'})

outliersA = int(np.percentile(callsByOperators_A['calls_count'], [99]))
outliersA

4893

outliersB = int(np.percentile(callsByOperators_B['calls_count'], [99]))
outliersB

2174

outliersC = int(np.percentile(callsByOperators_C['calls_count'], [99]))
outliersC

3804

#arranging the data in the riught order for the Mann-Whitney U test
sample_op_A = callsByOperators_A.query('calls_count < @outliersA')['calls_count'].values.tolist() #values for A sample
sample_op_B = callsByOperators_B.query('calls_count < @outliersB')['calls_count'].values.tolist() #values for B sample
sample_op_C= callsByOperators_C.query('calls_count < @outliersC')['calls_count'].values.tolist() #values for C sample

#function for testing hypothesis
def check_hypothesis(group1,group2, alpha):
    p_value =  st.mannwhitneyu(group1, group2)[1]
    
    print('p-value: {0:.3f}'.format(p_value))

    if (p_value < alpha):
        print('We reject the null hypothesis for the two groups')
    else:
        print("We can't reject the null hypothesis for the two groups")

#Comparison of number of calls for operators, tariff plans A and B 
check_hypothesis(sample_op_A,sample_op_B, alpha_sidak)

p-value: 0.000
We reject the null hypothesis for the two groups

#Comparison of number of calls for operators, tariff plans A and C 
check_hypothesis(sample_op_A,sample_op_C, alpha_sidak)

p-value: 0.000
We reject the null hypothesis for the two groups

#Comparison of number of calls for operators, tariff plans B and C 
check_hypothesis(sample_op_B,sample_op_C, alpha_sidak)

p-value: 0.310
We can't reject the null hypothesis for the two groups

The data provides sufficient evidence, given the significance level we selected (alpha_sidak = 1.7%), to reject the null hypothesis for the groups A and B, A and C. Therefore, we can conclude that average number of calls made and received by operators for tafiff A differ from those for tariff B(μA != μB); average number of calls made and received by operators for tafiff A differ from those for tariff C (μA != μC).
At the same time data does not provide sufficient evidence, given the significance level of 1.7%, to reject the null hypothesis for groups B and C. Therefore, we can conclude that average number of calls made and received by operators for tafiff B and the average number of calls made and received by operators for tafiff C are the same (μB = μC).

We will plot the charts to visualize the calculations.

fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)

ax.set_title("Box Plot of calls by tariff plans", fontsize= 20)
ax.set

data = [callsByOperators_A['calls_count'],
        callsByOperators_B['calls_count'],
        callsByOperators_C['calls_count']]

ax.boxplot(data,
           labels= ['A', 'B', 'C'],
           showmeans= True)

plt.xlabel("Tariff plan")
plt.ylabel("Number of calls by operators")

plt.show()

The graphical testing of homogeneity of variances supports the statistical testing findings.
There is much more difference between number of calls for tariffs A/B and A/C;
there is no statistically significant difference between groups B and C.

T-test ¶

The t-test is invalid for small samples from non-normal distributions, but it is valid for large samples from non-normal distributions. Thus we are going to run the t-test for ous samples as they are are big enough (>50).
We will test the difference for every group in general to prove the previous calculations with t-test.

#forming a table
samplesA_B_C = df_tariffs.pivot_table(index='date', values='calls_count', columns='tariff_plan', aggfunc='sum').reset_index().fillna(0)
samplesA_B_C.columns = ['date', 'A', 'B', 'C']

samplesA_B_C.head()

Setting out outliers (threshold of number of calls for 99% of the sample)

outliersA_t = int(np.percentile(samplesA_B_C['A'], [99]))
outliersA_t

7301

outliersB_t = int(np.percentile(samplesA_B_C['B'], [99]))
outliersB_t

2890

outliersC_t = int(np.percentile(samplesA_B_C['C'], [99]))
outliersC_t

4486

Forming samples

sample_A_t = samplesA_B_C.query('A < @outliersA_t')['A'].values.tolist()

sample_B_t = samplesA_B_C.query('B < @outliersB_t')['B'].values.tolist()

sample_C_t = samplesA_B_C.query('C < @outliersC_t')['C'].values.tolist()

Checking variance

#defining Variance for the 3 atriffs to define whether or not we can consider them as equal for t-test.
variance_A_t = np.var(sample_A_t)
variance_B_t = np.var(sample_B_t)
variance_C_t = np.var(sample_C_t)

print('Variance for A sample is ', variance_A_t.round(1),\
     '\nVariance for B sample is ', variance_B_t.round(1),\
     '\nVariance for C sample is ', variance_C_t.round(1))

Variance for A sample is  4003966.7 
Variance for B sample is  825107.2 
Variance for C sample is  1507358.7

Variances vary greatly.

Testing hipotheses

results_A_B = st.ttest_ind(
    sample_A_t, 
    sample_B_t, equal_var = False)

results_A_C = st.ttest_ind(
    sample_A_t, 
    sample_C_t, equal_var = False)

results_C_B = st.ttest_ind(
    sample_C_t, 
    sample_B_t, equal_var = False)


#we are going to use alpha_sidak here as there were 3 comparisons made

print('p-value to compare the groups A and B: ', results_A_B.pvalue)
print('p-value to compare the groups A and C: ', results_A_C.pvalue)
print('p-value to compare the groups C and B: ', results_C_B.pvalue)



if (results_A_B.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups A and B")
else:
    print("Null hypothesis not rejected for groups A and B")
    
if (results_A_C.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups A and C")
else:
    print("Null hypothesis not rejected for groups A and C")
    
if (results_C_B.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups C and B")
else:
    print("Null hypothesis not rejected for groups C and B")

p-value to compare the groups A and B:  4.440736641361894e-15
p-value to compare the groups A and C:  1.6935552051637212e-10
p-value to compare the groups C and B:  0.03714718957914091
Null hypothesis rejected for groups A and B
Null hypothesis rejected for groups A and C
Null hypothesis not rejected for groups C and B

fig = plt.figure(figsize= (8, 8))
ax = fig.add_subplot(111)

ax.set_title("Box Plot of calls by tariff plans", fontsize= 20)
ax.set

data = [samplesA_B_C['A'],
        samplesA_B_C['B'],
        samplesA_B_C['C']]

ax.boxplot(data,
           labels= ['A', 'B', 'C'],
           showmeans= True)

plt.xlabel("Tariff plan")
plt.ylabel("Number of calls")

plt.show()

Both tests proved that we cannot reject the null hypothesis for groups B and C. Therefore, we can conclude that average number of calls for tariff plan B and C do not differ: (μB = μC ).
But we can reject it for comparisons A/B and A/C: the data provides sufficient evidence, given the significance level (1.7%), to reject the null hypothesis).
Therefore, we can conclude that average number of calls for tariff plan A differs from those for tariff plan B and C: (μA != μB and μA != μC ).
Consequently the difference in operators performance is not due to chance and has is statistically significant. Work of operators fo tariff A differs from work of operators for other tariffs.

Part 4. Overall conclusion ¶

back to top

previous part

In course of work we have studied data on call company clients and operators:

data on calls from 2019-08-02 till 2019-11-28;
unique users - 307;
unique operators - 1092;
and tariff_plans A, B and C.
We have preprocessed the data:
dropped missing values,
dropped duplicated data,
changed the data type and improved the memory usage.

Having studied the given features (number of missed incoming calls, waiting time for incoming calls (average speed of answer), number of outgoing calls, number of internal calls) we have generated 4 lists of operators that performed worse than their colleagues on average. We used z-score method to define the threshold. Next we created 3 groups of operators combining the 4 lists:

the least effective operators - that performed ineffectively in every of 4 spheres;
operators of lower effectivness - performed ineffectively in 2 spheres (longer speed of answer and fewer phone calls were taken as the main criteria);
operators of average effectivness - performed ineffectively in one of 4 spheres.

Likewize we created such groups for every tariff plan. Analyzing the operators performance for different tariff plans we have defined that that share of operators that performed badly is the highest for tariff A and the lowest for tariff C. Tariff A has the smallest number of outgoing calls, the smallest number of answered calls and the biggest share of internal calls. Generally operators working on plan A demonstrate lower professional level than their colleagues working on other tariff plans.

We have tested hypothesis that number of calls depend od tariff plan of the call. The calculations show that with given significance level (1.7%) there is a statistically signifficant difference in number of calls for tariff A, and there is no statistical difference between number of calls for tariff B and C. Supposedly due to a big number of new users of tariff A that joined the company in the middle of October the company have not allocated enough operators and not provided enough trainings.

The results of the study can be used further for labeling , supervised learning and developing a model for ineffective operators . This will automate the process of defining ineffective operators in future.

General recommendations:

- additional trainings for operators  working on tariff A (more insights into tariff A aspects); 
- additional motivation for operators  working on tariff A (bonus system, incentives, etc.)
- monitoring operators' performance that got to group 'operators of average effectiveness’;
- arrange additional training for those who got to group ' operators of lower effectiveness’;
- additional training and supervised work for those who got to group 'the least effective operators’
- rewards and incentives for those who got to group ‘good operators’.

To concude we would recommend to make a shift from defining operators' effectiveness by setting such goals as handling a certain number of calls within a designated period of time. This puts a lot of pressure on operators and does not lead to clients' sutisfaction in the long run. Today's effectiveness might rather be measured in resolved issues rate and retantion rate. Resolving issues and customers satisfaction can be better motivation for operators than reaching big number of calls.

Tableau dashboard ¶

Tableau viz link: https://public.tableau.com/profile/nulf#!/vizhome/YandexPracticum100Final/CallMeMaybe?publish=yes

Presentation ¶

Link: https://drive.google.com/file/d/1NZ3BwLZJpI44fBturmFveIvFTzzS_Xjo/view?usp=sharing

Sources: ¶

Information on timezones in Pandas: https://docs.python.org/3/library/datetime.html
Bubble charts in Plotly+cufflinks - to display multiple variable relationship in same time (vizualization with cufflinks): https://medium.com/@ozan/interactive-plots-with-plotly-and-cufflinks-on-pandas-dataframes-af6f86f62d94
T-test and Wilcoxon-Mann-Whitney test: https://in.sagepub.com/sites/default/files/upm-assets/70309_book_item_70309.pdf
Wilcoxon-Mann-Whitney or t-test? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/. (Contains a useful table on test choice: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/table/T1/)
Statisical tests for not normally distributed data - https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/#:~:text=Dealing%20with%20Non%20Normal%20Distributions,test%20and%20ANOVA%20assume%20normality.&text=Non%20parametric%20tests%20include%20the,and%20the%20Kruskal%2DWallis%20test.
Non-parametric tests overview https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/
Well explained Mann-Whiney test: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html
Insights into work of a call center: https://www.westuc.com/en-us/blog/cloud-contact-center/6-ways-motivate-call-center-staff-and-boost-morale

back to top

	user_id	tariff_plan	date_start
0	166713	A	2019-08-15
1	166901	A	2019-08-23
2	168527	A	2019-10-29
3	167097	A	2019-09-01
4	168193	A	2019-10-16

	calls_count	call_duration	total_call_duration
count	37362.0	37362.00	37362.00
mean	16.9	1009.77	1320.92
std	60.6	4123.89	4850.72
min	1.0	0.00	0.00
25%	1.0	0.00	67.00
50%	4.0	105.00	286.00
75%	13.0	768.00	1098.00
max	4817.0	144395.00	166155.00

	operator_id	calls_count	call_duration	total_call_duration	waiting_time	avg_waiting_time	waiting_time_z
283	913938	4311	316681	352891	36210	8.399443	-0.753448
661	952948	2972	334958	339609	4651	1.564939	-1.314856
284	913942	2293	192529	220098	27569	12.023114	-0.455788

	tariff_plan	missed_calls_share
0	A	0.353561
1	B	0.345068
2	C	0.323746

	tariff_plan	internal
0	A	0.161791
1	B	0.108376
2	C	0.128192

	user_id	date	direction	internal	operator_id	is_missed_call	calls_count	call_duration	total_call_duration
0	166377	2019-08-04 00:00:00+03:00	in	False	NaN	True	2	0	4
1	166377	2019-08-05 00:00:00+03:00	out	True	880022.0	True	3	0	5
2	166377	2019-08-05 00:00:00+03:00	out	True	880020.0	True	1	0	1
3	166377	2019-08-05 00:00:00+03:00	out	True	880020.0	False	1	10	18
4	166377	2019-08-05 00:00:00+03:00	out	False	880022.0	True	3	0	25

	user_id	operator_id
264	168187	50
175	167626	48
161	167497	30
271	168252	28
248	168062	27
...	...	...
137	167265	0
281	168329	0
241	168024	0
236	168013	0
226	167981	0

	user_id	date	direction	internal	operator_id	is_missed_call	calls_count	call_duration	total_call_duration
8637	166658	2019-09-23	out	False	890420	False	11	5453	5574
23211	167078	2019-11-15	in	False	935348	False	11	851	923
45850	168064	2019-11-25	in	False	939762	False	24	1000	1458
11313	166691	2019-09-30	in	False	889410	False	5	232	264
52809	168361	2019-11-19	out	False	965540	False	26	2188	2817
49694	168187	2019-11-28	out	False	937898	False	6	213	347
17634	166941	2019-09-01	in	False	896382	False	3	242	280
9123	166658	2019-10-16	out	False	891158	True	2	0	73
4013	166503	2019-09-03	out	False	884412	True	13	0	374
32366	167466	2019-10-11	in	False	921818	False	26	2411	2638
37715	167634	2019-10-07	in	False	918436	False	1	37	45
198	166377	2019-09-12	out	False	880028	False	16	1030	1161
31978	167445	2019-11-20	in	False	920902	False	1	41	48
18404	166974	2019-10-21	out	False	900354	False	19	1701	1945
8635	166658	2019-09-23	out	False	890412	True	3	0	110

	user_id	date	direction	internal	operator_id	is_missed_call	calls_count	call_duration	total_call_duration
4353	166511	2019-08-20	out	False	891410	True	1	0	8
4354	166511	2019-08-20	out	True	891410	True	1	0	0
4356	166511	2019-08-20	out	False	891410	False	2	37	65
4361	166511	2019-08-21	out	True	891410	False	2	21	30
4362	166511	2019-08-21	in	False	891410	False	4	73	126
...	...	...	...	...	...	...	...	...	...
5189	166511	2019-11-27	out	False	891410	True	12	0	45
5197	166511	2019-11-28	in	False	891410	False	7	1634	1723
5199	166511	2019-11-28	out	False	891410	True	8	0	114
5202	166511	2019-11-28	out	False	891410	False	29	3429	3612
5204	166511	2019-11-28	in	False	891410	False	7	1634	1723

	operator_id	is_missed_call
284	913942	27
22	885890	20
561	940588	15
587	944226	13
21	885876	13
...	...	...
295	914848	0
296	914870	0
298	915614	0
299	915958	0
738	973286	0

	user_id	date	direction	internal	operator_id	is_missed_call	calls_count	call_duration	total_call_duration
26	166377	2019-08-12	in	False	880028	False	1	407	411
34	166377	2019-08-13	in	False	880028	False	1	88	102
37	166377	2019-08-14	in	False	880026	False	2	197	218
39	166377	2019-08-14	in	False	880028	False	1	33	37
49	166377	2019-08-15	in	False	880028	False	1	23	27

	operator_id	calls_count	calls_count_z
24	885890	53301	18.647472
23	885876	51977	18.178754
462	929428	22936	7.897764

	operator_id	calls_count
163	907986	393
135	906394	366
128	905574	307
136	906396	299
126	905566	290

	operator_id	calls_count	internal_calls_z
163	907986	393	6.976522
135	906394	366	6.465475
128	905574	307	5.348742

	operators_groups	operators_number
0	avg_effective_operators	857
1	bad_operators	155
2	worst_operators	17

	operators_cnt	users_cnt
0	0	17
1	1	107
2	2	63

	user_id	date	direction	internal	operator_id	is_missed_call	calls_count	call_duration	total_call_duration	tariff_plan	date_start
0	166377	2019-08-05	out	True	880022	True	3	0	5	B	2019-08-01
1	166377	2019-08-05	out	True	880020	True	1	0	1	B	2019-08-01
2	166377	2019-08-05	out	True	880020	False	1	10	18	B	2019-08-01

	date	B	C
0	2019-08-02	19.0	0.0
1	2019-08-03	27.0	0.0
2	2019-08-04	4.0	0.0
3	2019-08-05	175.0	0.0
4	2019-08-06	148.0	1.0