The aim of the project is to investigate user behavior for the company's app (a startup that sells food products).
First we will preprocess the data and study the sales funnel:

  • find out how users reach the purchase stage
  • How many users actually make it to this stage
  • how many get stuck at previous stages

Next we will check the results of an A/A/B test: if change of the fonts for the entire app have any impact on users. We will pick up the equal number of users for all 3 groups and run a statistical test to check if there is a statistically significant difference in number of users for any of the events. </br>

Table of contents:

In [1]:
import pandas as pd
from IPython.display import display
from datetime import datetime
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import seaborn as sns
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
import math 

Part 1. Data preprocessing

Reading the data base

In [2]:
#adjustments of the path for the Practicum platform:
df = pd.read_csv('./logs_exp_us.csv', delim_whitespace=True)
display(df.head())
EventName DeviceIDHash EventTimestamp ExpId
0 MainScreenAppear 4575588528974610257 1564029816 246
1 MainScreenAppear 7416695313311560658 1564053102 246
2 PaymentScreenSuccessful 3518123091307005509 1564054127 248
3 CartScreenAppear 3518123091307005509 1564054127 248
4 PaymentScreenSuccessful 6217807653094995999 1564055322 248
In [3]:
display(df.sample(6))
EventName DeviceIDHash EventTimestamp ExpId
62700 PaymentScreenSuccessful 4089267023360117805 1564758931 248
165178 MainScreenAppear 8544371504846351459 1565019278 246
215708 OffersScreenAppear 8273551061341945191 1565154626 248
136265 MainScreenAppear 6820216004207280160 1564942506 247
185221 MainScreenAppear 5476114506218237691 1565078994 247
65114 CartScreenAppear 6996377124129943223 1564762399 247

Checking and changing the data types

In [4]:
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   EventName       244126 non-null  object
 1   DeviceIDHash    244126 non-null  int64 
 2   EventTimestamp  244126 non-null  int64 
 3   ExpId           244126 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 16.4 MB

In [5]:
#checking the object-type column if it can be turned to a category one. 
df['EventName'].value_counts()
Out[5]:
MainScreenAppear           119205
OffersScreenAppear          46825
CartScreenAppear            42731
PaymentScreenSuccessful     34313
Tutorial                     1052
Name: EventName, dtype: int64
In [6]:
df['EventName'] = df['EventName'].astype('category')
In [7]:
#timestamp column needs to be converted, getting rid of unix time format
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'],unit='s')
display(df['EventTimestamp'])
0        2019-07-25 04:43:36
1        2019-07-25 11:11:42
2        2019-07-25 11:28:47
3        2019-07-25 11:28:47
4        2019-07-25 11:48:42
                 ...        
244121   2019-08-07 21:12:25
244122   2019-08-07 21:13:59
244123   2019-08-07 21:14:43
244124   2019-08-07 21:14:58
244125   2019-08-07 21:15:17
Name: EventTimestamp, Length: 244126, dtype: datetime64[ns]
In [8]:
#re-cheking memory usage
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   EventName       244126 non-null  category      
 1   DeviceIDHash    244126 non-null  int64         
 2   EventTimestamp  244126 non-null  datetime64[ns]
 3   ExpId           244126 non-null  int64         
dtypes: category(1), datetime64[ns](1), int64(2)
memory usage: 5.8 MB

The memory usage has dropped significatly: from 16.4 MB to 5.8 MB.

Adding separate columns with time and date

In [9]:
df['event_date']  = df['EventTimestamp'].dt.date
df['event_time'] = df['EventTimestamp'].dt.time
df.sample(6)
Out[9]:
EventName DeviceIDHash EventTimestamp ExpId event_date event_time
189000 MainScreenAppear 1888138255084738292 2019-08-06 10:05:08 246 2019-08-06 10:05:08
200395 MainScreenAppear 5335957243538234701 2019-08-06 15:19:28 247 2019-08-06 15:19:28
236719 MainScreenAppear 4233725776879391824 2019-08-07 16:26:30 246 2019-08-07 16:26:30
151394 CartScreenAppear 3689505530499390250 2019-08-05 10:00:39 246 2019-08-05 10:00:39
38050 CartScreenAppear 4623191541214045580 2019-08-01 21:10:26 247 2019-08-01 21:10:26
223144 MainScreenAppear 4901615437367572701 2019-08-07 10:04:56 248 2019-08-07 10:04:56
In [10]:
df['event_date'] = pd.to_datetime(df['event_date'], format='%Y-%m-%d')
#df['event_time'] = pd.to_datetime(df['event_time'], format='%H:%M:%S')

Renaming columns

In [11]:
#renaming columns
df.columns = ['event_name' ,'user_id', 'timestamp', 'experiment_number', 'event_date', 'event_time']
df.sample(6)
Out[11]:
event_name user_id timestamp experiment_number event_date event_time
102439 MainScreenAppear 4523373475875031235 2019-08-03 17:36:11 246 2019-08-03 17:36:11
91984 CartScreenAppear 4623191541214045580 2019-08-03 12:58:12 247 2019-08-03 12:58:12
15980 PaymentScreenSuccessful 5726813107141430693 2019-08-01 11:18:56 247 2019-08-01 11:18:56
195223 OffersScreenAppear 3436171726922480780 2019-08-06 13:20:48 248 2019-08-06 13:20:48
66929 OffersScreenAppear 7773551664554459042 2019-08-02 17:02:33 246 2019-08-02 17:02:33
239743 MainScreenAppear 955444249670387244 2019-08-07 17:55:10 248 2019-08-07 17:55:10

Checking for duplicated data

In [12]:
#Checking for duplicated data
df.duplicated().sum()
Out[12]:
413
In [13]:
#checking what kind of duplicates we have
duplicated_rows = df[df.duplicated()]
duplicated_rows.sample(15)
Out[13]:
event_name user_id timestamp experiment_number event_date event_time
86150 CartScreenAppear 3447945165857834216 2019-08-03 10:36:16 247 2019-08-03 10:36:16
130558 MainScreenAppear 33176906322804559 2019-08-04 15:37:55 248 2019-08-04 15:37:55
124800 PaymentScreenSuccessful 7180477252978892964 2019-08-04 12:57:14 247 2019-08-04 12:57:14
137612 MainScreenAppear 8153398568531477841 2019-08-04 19:02:22 248 2019-08-04 19:02:22
117195 PaymentScreenSuccessful 6744475526048766609 2019-08-04 09:48:14 247 2019-08-04 09:48:14
60524 MainScreenAppear 3267675483921543552 2019-08-02 14:24:45 247 2019-08-02 14:24:45
97581 PaymentScreenSuccessful 3610536745613892312 2019-08-03 15:21:03 248 2019-08-03 15:21:03
173339 MainScreenAppear 3315376538884741797 2019-08-05 19:02:43 247 2019-08-05 19:02:43
193890 PaymentScreenSuccessful 6932517045703054087 2019-08-06 12:41:18 248 2019-08-06 12:41:18
81499 PaymentScreenSuccessful 1211886626193804462 2019-08-03 08:30:29 246 2019-08-03 08:30:29
172741 CartScreenAppear 7586870289028771156 2019-08-05 18:37:07 247 2019-08-05 18:37:07
12454 PaymentScreenSuccessful 5152160705477623487 2019-08-01 09:35:39 247 2019-08-01 09:35:39
190193 MainScreenAppear 2965031923404247543 2019-08-06 10:43:55 246 2019-08-06 10:43:55
21010 PaymentScreenSuccessful 4980045449118619005 2019-08-01 13:20:40 247 2019-08-01 13:20:40
220106 MainScreenAppear 7110398937204962158 2019-08-07 08:26:05 248 2019-08-07 08:26:05
In [14]:
#checking the dates when the data was duplicated:
duplicated_rows['timestamp'].dt.date.unique()
Out[14]:
array([datetime.date(2019, 7, 30), datetime.date(2019, 7, 31),
       datetime.date(2019, 8, 1), datetime.date(2019, 8, 2),
       datetime.date(2019, 8, 3), datetime.date(2019, 8, 4),
       datetime.date(2019, 8, 5), datetime.date(2019, 8, 6),
       datetime.date(2019, 8, 7)], dtype=object)

So far we cannot see any pattern in duplicates. There are data on different events, dates (from 30.07.2019 till 7.08.2019), all experiment groups.

In [15]:
#checking area of the problem caused duplicated data:
for i in duplicated_rows:
    print(i, ":", duplicated_rows[i].nunique())
event_name : 5
user_id : 237
timestamp : 352
experiment_number : 3
event_date : 9
event_time : 352

The server had some problem duplicated data during all 5 events on 237 unique users with 352 timestamps in all 3 experiment groups.

Duplicated data is only 0.17% of the whole data base (413/244126) thus we can drop it without skewing the analysis.

In [16]:
#getting rid of duplicated data
df = df.drop_duplicates().reset_index(drop = True)

Checking for missing values

In [17]:
#Checking data for zeros:
for i in df.columns:
    print(i, len(df[df[i]==0]))
event_name 0
user_id 0
timestamp 0
experiment_number 0
event_date 0
event_time 0
In [18]:
# checking the number of experiments in every group: 
df['experiment_number'].value_counts()
Out[18]:
248    85582
246    80181
247    77950
Name: experiment_number, dtype: int64
In [19]:
#statistical summary for categorical variables
df.describe(include=['category'])
Out[19]:
event_name
count 243713
unique 5
top MainScreenAppear
freq 119101
In [20]:
#Identifing the number of unique users:
df['user_id'].value_counts().count()
Out[20]:
7551
In [21]:
df['user_id'].nunique()
Out[21]:
7551
In [22]:
df.shape
Out[22]:
(243713, 6)

Conclusion

The data for analysis contains 4 rows with 244126 entries; data on 5 event types for 7551 unique users.
There are 5 unique EventName originally marked as object type. For the sake of memory usage we have converted it to a category type. "MainScreenAppear " is the most frequently appearing category (119205 ot of 244126 ).

Another type conversion has been proceeded with EventTimestamp. We have converted unix time format to datetime and created two separate columns with date of event and time of event.
Changing data type has allowed us to to improve the memory usage ( memory usage has dropped significatly: from 16.4 MB to 5.8 MB).
There were 413 duplicates (0.17%) in the data set that have been successfully dropped.
No missing values have been found in data set.
We have renamed the columns using clear names without upper case letters.
Resulting data set has 6 columns with 243713 rows.

Part 2. Data analysis

How many events are there in the logs

As we have just figured it out in the previous part see here, there are 5 types of events in the data/funnel. Below is the graphical interpretation of the data.

In [23]:
#creating a table 
df_events = df.groupby(['event_name']).agg({'user_id': 'count'}).reset_index()
#renaming columns
df_events.columns = ['event_name' ,'number']
df_events['event_name'] = df_events['event_name'].astype(str)
df_events.sort_values(by='number', ascending=False)
Out[23]:
event_name number
1 MainScreenAppear 119101
2 OffersScreenAppear 46808
0 CartScreenAppear 42668
3 PaymentScreenSuccessful 34118
4 Tutorial 1018
In [24]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(df_events.event_name, df_events.number, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='users')

ax.set_xticklabels(df_events.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(df_events["number"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Number of different types of events')
plt.show()
In [25]:
#pie chart for types of events
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'blue']
labels = df_events['event_name']
values = df_events['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Shares of different types of events')
fig.update_traces(hoverinfo='label+percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

We can see that almost half of the events are MainScreen; OfferScreen and Cart screen have almost equal shares of 19-17.5%. The least frequent event is watching tutotials. Either they do not have any value to the customers or it is hard to find.

How many users are in the logs

In [26]:
df['user_id'].nunique()
print('There are', df['user_id'].nunique(), 'unique users in the data set.')
There are 7551 unique users in the data set.

What's the average number of events per user?

In [27]:
#calculating events per user
df_events_users = df.groupby(['user_id']).agg({'event_name': 'count'}).sort_values(by='event_name', ascending=False).reset_index()
df_events_users
Out[27]:
user_id event_name
0 6304868067479728361 2307
1 197027893265565660 1998
2 4623191541214045580 1768
3 6932517045703054087 1439
4 1754140665440434215 1221
... ... ...
7546 7399061063341528729 1
7547 2968164493349205501 1
7548 8071397669512236988 1
7549 425817683219936619 1
7550 6888746892508752 1

7551 rows × 2 columns

In [28]:
#checking up how many unique events the unique users performed:
unique_events_per_users = df.groupby(['user_id'])['event_name'].nunique().reset_index().groupby(['event_name'])['user_id'].nunique()
unique_events_per_users
Out[28]:
event_name
1    2707
2    1021
3     317
4    3035
5     471
Name: user_id, dtype: int64

Only 471 user had all 5 events.

In [29]:
df_events_users[['event_name']].describe()
Out[29]:
event_name
count 7551.000000
mean 32.275593
std 65.154219
min 1.000000
25% 9.000000
50% 20.000000
75% 37.000000
max 2307.000000
In [30]:
#poltting distribution
plt.figure(figsize=(9,5))
sns.distplot(df_events_users['event_name']) 
 
plt.title('number of events per user')
plt.show()
In [31]:
# box-and-whisker plot
plt.figure(figsize=(9,5))
ax = sns.boxplot(x="event_name", data=df_events_users) 
In [32]:
#Calculatinug the 95th and 99th percentiles for the number of events (setting out outliers). 
np.percentile(df_events_users['event_name'], [95, 99])
Out[32]:
array([ 89. , 200.5])

The number of events per user vary from 1 to 2307. Middle 50% of users had from 9 to 37 events.

Not more than 5% of users had over 89 events and only 1% of chains had over 200 events. Thus users with over 200 events are outliers for our distribution.

Mean value is 32 events, median value is 20. As we have positively skewed distribution with a lot of outliers to the right, we will deem median value as an average value for the distribution. Thus average number of events per user is 20.

What period of time does the data cover? Find the moment at which the data starts to be complete and ignore the earlier section. What period does the data actually represent?

In [33]:
df[['event_date', 'event_time']].describe()
Out[33]:
event_date event_time
count 243713 243713
unique 14 69824
top 2019-08-01 00:00:00 12:29:55
freq 36141 19
first 2019-07-25 00:00:00 NaN
last 2019-08-07 00:00:00 NaN
In [34]:
df['timestamp'].min()
Out[34]:
Timestamp('2019-07-25 04:43:36')
In [35]:
df['timestamp'].max()
Out[35]:
Timestamp('2019-08-07 21:15:17')

The time covers period of 14 days starting from 2019-07-25 04:43:36 to 2019-08-07 21:15:17.

In [36]:
dates_views = df.groupby(['event_date']).agg({'user_id':'count'}).reset_index()
dates_views.columns = ['event_date' ,'events_number']
dates_views.reset_index(drop=True)
Out[36]:
event_date events_number
0 2019-07-25 9
1 2019-07-26 31
2 2019-07-27 55
3 2019-07-28 105
4 2019-07-29 184
5 2019-07-30 412
6 2019-07-31 2030
7 2019-08-01 36141
8 2019-08-02 35554
9 2019-08-03 33282
10 2019-08-04 32968
11 2019-08-05 36058
12 2019-08-06 35788
13 2019-08-07 31096
In [37]:
dates_views.plot(x='event_date', y='events_number', kind='bar', grid=True, figsize=(20,5), cmap='Set2', title='Number of Events per Date')

plt.show()
In [38]:
fig = px.area(dates_views, x='event_date', y="events_number", title='Number of Events per Date')
fig.show()
In [39]:
#slicing the data 
df_complete = df.query('event_date>="2019-08-01"')
df_complete['timestamp'].min()
Out[39]:
Timestamp('2019-08-01 00:07:28')
In [40]:
df_complete['timestamp'].max()
Out[40]:
Timestamp('2019-08-07 21:15:17')

Despite the fact that we have data starting from 25/07/2019 it is clear that the data is incomplete till 01/08/2019. That might have happened due to some technical reasons. Thus we have to slice the data base respectively. The actual data represents period of one week, from 2019-08-01 00:07:28 till 2019-08-07 21:15:17.

Did you lose many users and events when excluding the older data?

In [41]:
df_excluded = df.query('event_date<"2019-08-01"')
df_excluded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2826 entries, 0 to 2825
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   event_name         2826 non-null   category      
 1   user_id            2826 non-null   int64         
 2   timestamp          2826 non-null   datetime64[ns]
 3   experiment_number  2826 non-null   int64         
 4   event_date         2826 non-null   datetime64[ns]
 5   event_time         2826 non-null   object        
dtypes: category(1), datetime64[ns](2), int64(2), object(1)
memory usage: 124.3+ KB
In [42]:
df_excluded['user_id'].nunique()
Out[42]:
1451

In course of slicing we have lost 2826 entries(which is 1.16 % of the data base) and 1451 unique users (which is 19% of the data base).

In [43]:
#checking the groups:
df_complete.experiment_number.value_counts()
Out[43]:
248    84563
246    79302
247    77022
Name: experiment_number, dtype: int64
In [44]:
#checking the events:
df_complete.event_name.unique()
Out[44]:
[Tutorial, MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful]
Categories (5, object): [Tutorial, MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful]

We still have users from all of 3 experiment groups as well as all the events in the filtered data set.

Conclusion.

There are 5 types of events in the data set: almost half of the logs was on the MainScreen, then go OfferScreen and Cart screen with slightly less than 20% each, PaymentScreen got 14% of logs and the least popular type is Tutorial with only 0.4% of logs. Number events per user vary greatly, from 1 to 2307; middle 50% of users had from 9 to 37 events; the average number of events per user is 20.
There are logs from 2019-07-25 to 2019-08-07 in the data set, but study of data distribution shows that the logs data is incomplete till 2019-07-31 inclusively. Thus we had to drop the data on logs till 01/08/2019. Due to the slice we have lost 1.16 % of enties (2826 entries) and 19% of unique users (1451 users IDs out of 7551 unique users ).
This shows that the data with unique users was not evenly distributed. Some technical issues with users' logs could account for that.
The filtered data still has users from all three experimental groups.

Event funnel


[back to top](#top)

[previous part](#7)

What events are in the logs and their frequency of occurrence

We will define frequency of every event type.

In [45]:
events = df_complete.groupby('event_name').agg({'user_id': ['count']}).reset_index()
events.columns = ['event_name','frequency'] #renaming columns
events = events.sort_values(by='frequency', ascending=False)
events.head(10)
Out[45]:
event_name frequency
1 MainScreenAppear 117328
2 OffersScreenAppear 46333
0 CartScreenAppear 42303
3 PaymentScreenSuccessful 33918
4 Tutorial 1005
In [46]:
fig, ax = plt.subplots(figsize=(12, 7))
ax.set_title(' frequency of events',fontsize=16)
plt.xlabel('Type of event',fontsize=12)
plt.ylabel('frequency',fontsize=12)
colors = ['lightblue', 'mediumturquoise', 'orange', 'lightgreen', 'plum', 'bisque','lavender',
          'lightcyan','palevioletred'] 
#plot
g = plt.bar(events.event_name, events.frequency,
        0.7, label='Events', color=colors[1], alpha=0.8, edgecolor='black',linewidth=2)

#annotation
bar_label = events.frequency.tolist() #values for text
bar_label = [str(label) for label in bar_label]
def autolabel(rects):
    for idx,rect in enumerate(g):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.02*height,
                bar_label[idx],
                ha='center', va='bottom', rotation=0, size=15)
        
autolabel(g)  

plt.legend() 
plt.show()
In [47]:
#pie chart for types of events

colors = ['lightblue', 'mediumturquoise', 'orange', 'lightgreen', 'plum', 'bisque','lavender',
          'lightcyan','palevioletred'] 

labels = events['event_name']
values = events['frequency']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Shares of different types of events')
fig.update_traces(hoverinfo='label+text+value', textfont_size=20, 
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

According to our data there are 5 events, the most frequent one is MainScreen, it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen is 14% and the least popular is Tutorial event (0.4%).

Number of users who performed each of these actions. Proportion of users who performed the action at least once.

Number of users who had each of these events


Firstly we will group the data by users ID and define the number of actions for every event for each user.

Then we will find the number of users who had each of these events by dropping those who have NaNs in any column.

In [48]:
df_complete['event_name'] = df_complete['event_name'].astype(str)
In [49]:
users_per_events = df_complete.pivot_table(index='user_id',columns='event_name',values='timestamp',aggfunc='count').reset_index()
users_per_events
Out[49]:
event_name user_id CartScreenAppear MainScreenAppear OffersScreenAppear PaymentScreenSuccessful Tutorial
0 6888746892508752 NaN 1.0 NaN NaN NaN
1 6909561520679493 1.0 2.0 1.0 1.0 NaN
2 6922444491712477 8.0 19.0 12.0 8.0 NaN
3 7435777799948366 NaN 6.0 NaN NaN NaN
4 7702139951469979 5.0 40.0 87.0 5.0 NaN
... ... ... ... ... ... ...
7529 9217594193087726423 3.0 NaN 3.0 3.0 NaN
7530 9219463515465815368 1.0 11.0 4.0 1.0 NaN
7531 9220879493065341500 1.0 4.0 1.0 NaN NaN
7532 9221926045299980007 NaN 7.0 NaN NaN NaN
7533 9222603179720523844 NaN 59.0 NaN NaN NaN

7534 rows × 6 columns

In [50]:
#Picking up the users that had all the events
users_all_events = users_per_events[users_per_events.notnull().all(axis=1)].reset_index(drop=True)
users_all_events
Out[50]:
event_name user_id CartScreenAppear MainScreenAppear OffersScreenAppear PaymentScreenSuccessful Tutorial
0 9841258664663090 8.0 10.0 8.0 2.0 1.0
1 29094035245869447 2.0 10.0 10.0 1.0 1.0
2 77364241990273403 7.0 68.0 17.0 5.0 1.0
3 84107839139059565 7.0 17.0 20.0 6.0 2.0
4 89078601045475920 9.0 9.0 10.0 4.0 1.0
... ... ... ... ... ... ...
461 9126019227179029392 4.0 13.0 11.0 2.0 1.0
462 9132678164084161445 2.0 14.0 2.0 2.0 1.0
463 9141951425188555065 5.0 7.0 12.0 4.0 1.0
464 9158074215019937335 1.0 3.0 8.0 1.0 1.0
465 9178917234910208523 4.0 19.0 26.0 4.0 1.0

466 rows × 6 columns

In [51]:
users_all_events.shape
Out[51]:
(466, 6)
In [52]:
#picking up all the users IDs that prformed all the actions and saving them in a separate list
user_id_all_events = users_all_events['user_id'].tolist()
user_id_all_events
Out[52]:
[9841258664663090,
 29094035245869447,
 77364241990273403,
 84107839139059565,
 89078601045475920,
 98792428456080413,
 116454972706892511,
 133486548927612375,
 147514000526855717,
 170708003193483797,
 203838813641156129,
 223230608818939965,
 239795526067822384,
 284324717439504162,
 288947541653489764,
 292563824900182495,
 308940995223272529,
 355651113645516604,
 378761015309962378,
 396921694825627632,
 436215067153454698,
 459177279692283865,
 515870279844015150,
 518297313815559468,
 533356125006419558,
 557622585227857656,
 559149934052182668,
 582264770456673374,
 617758068537391848,
 623940637775064759,
 624637828747274142,
 638598994881293831,
 646027458777469832,
 653322154533311661,
 655561191595997473,
 671922718563083939,
 674541835027541643,
 730864889612327407,
 733544924898733660,
 733563556336010524,
 745032079672450813,
 780798538749079517,
 784865651125640705,
 809934975781585929,
 826328741041532622,
 837505346045732408,
 863114068617027014,
 865006356323074101,
 888617422919024415,
 907516373885804773,
 909836633434978322,
 928893924262115877,
 941603036904761274,
 943476888243120350,
 972102914695797995,
 1025423183050919111,
 1052317782824159991,
 1064069332576331836,
 1113177574547813281,
 1158644344098364192,
 1217537191439577589,
 1266450414894434600,
 1309234519709630135,
 1362092663681182910,
 1373199082330812873,
 1392555893172405118,
 1411471843730263943,
 1446068587914200066,
 1479224128115420988,
 1492262579192542642,
 1493051727801961177,
 1501908752948001879,
 1518481827927168526,
 1535041857858656812,
 1536833952192676536,
 1582018592022318519,
 1601876840816097347,
 1604220726746226764,
 1630601299114293624,
 1656055605943893565,
 1675308806875779310,
 1683981161505856484,
 1705097928192746012,
 1729553690289446391,
 1736444126983851292,
 1750006600386297037,
 1800462567619643092,
 1857029223517498437,
 1884616937397676465,
 1897865515518262253,
 1919610389675188301,
 1921021973150069353,
 1922717626667551198,
 1925633498102429867,
 1984254026148347245,
 2045616237532505490,
 2056649048651349082,
 2131179201240516914,
 2147812610911632982,
 2158185118578389474,
 2187914245613435519,
 2252923612776743772,
 2254821219381623694,
 2254831023563140137,
 2300292234025330845,
 2317120910450320490,
 2326645966387673457,
 2335509695522126029,
 2339185552301026694,
 2380401036382252064,
 2382232476374805470,
 2382591782303281935,
 2391591272127366891,
 2433579795068677370,
 2435066505702064487,
 2438520924615884865,
 2449861889676075616,
 2467833017081929342,
 2520797580915790298,
 2540662527663401258,
 2543916046456558866,
 2580223756192068322,
 2602421655779348355,
 2648901857674049202,
 2655003569661492360,
 2656704080941585379,
 2664251778992193918,
 2674429336791405101,
 2730869064219188185,
 2785869854396882528,
 2789808063375807388,
 2815358815454061015,
 2825531614269746200,
 2826841367164455095,
 2831935993739683325,
 2863001408302662673,
 2882506928148115110,
 2885844202303657350,
 2927366723795973070,
 2934598532887670123,
 2998654154534786354,
 3006207371161637636,
 3009181927597130538,
 3012866736489345415,
 3048551176412455623,
 3063248032210795834,
 3095493255635125557,
 3116041239619625045,
 3118947941510680801,
 3132175184219980626,
 3163109716415429772,
 3193214496697386946,
 3209107700543981369,
 3278779007623731800,
 3296122068432336013,
 3304099753638451815,
 3325965414119122960,
 3357533074851319011,
 3383558428248432049,
 3432974492367330155,
 3452801840013988266,
 3461821332318259564,
 3467820901514995349,
 3494457950643850294,
 3495371200460644295,
 3524225896421726669,
 3530408535473465435,
 3551574897384590025,
 3551628613719517043,
 3583643616592343673,
 3583793367476243471,
 3590237880935165984,
 3603221798976493506,
 3610536745613892312,
 3642576412734261225,
 3666940454338004242,
 3699851072758855337,
 3727026283458859340,
 3737462046622621720,
 3741232419948091674,
 3797068122842442849,
 3841080606985462509,
 3851785345612397300,
 3911101470526499807,
 3929083066121281051,
 3933210228218775681,
 3933415997375240038,
 3976695185007308383,
 3978277293793862524,
 4013129726324754028,
 4033058607059900421,
 4043409844133645660,
 4122971274797290089,
 4127885408060891243,
 4135346488014272392,
 4140772931377659202,
 4166558499317749276,
 4169814227533210428,
 4197536517730418733,
 4270075519805190262,
 4277982896725647346,
 4294735467443373934,
 4317941094862063746,
 4325877916710814063,
 4369662623769092250,
 4422920184752474035,
 4430756990024652847,
 4438811436010620371,
 4450322282753054807,
 4452986566566254241,
 4483332814045365748,
 4502974168843754195,
 4505932751780538579,
 4531819286005097531,
 4540724666797253101,
 4541002331784686531,
 4613635923158598610,
 4632680059380639630,
 4647441570935782596,
 4659199390928967478,
 4716741813371028384,
 4752520790520449030,
 4756008038721683263,
 4758223822167675176,
 4805046500663660664,
 4820610343436068485,
 4827377269497927948,
 4870358463848296984,
 4873876122944444943,
 4876403292056911122,
 4877161838950700944,
 4891111378981166101,
 4891397824124625593,
 4923306958232804249,
 4930649730653097635,
 4939283755029305754,
 4947898935541290421,
 4949721956109290387,
 4995105532329648960,
 5022646214748735876,
 5030737562854930656,
 5034536472312079357,
 5080403137071458745,
 5116837134502602653,
 5147280779493817978,
 5153489763805063987,
 5160333374588452378,
 5204904482314476306,
 5254900839726794643,
 5263436064683810729,
 5307812315857857498,
 5337309839474502499,
 5395446687580561656,
 5422877498318042547,
 5435933625972989901,
 5441587343016299285,
 5485754166608937705,
 5493099436862505022,
 5508858000645147218,
 5512830938322632746,
 5515378553085864642,
 5518942930638974593,
 5536820068885117452,
 5562465358645335882,
 5571251285435914458,
 5587388800188073787,
 5593644690011886950,
 5604501023679978892,
 5613225282971366197,
 5614748211953213901,
 5628367715793986203,
 5637074651270632739,
 5645768126178532765,
 5684052780286856198,
 5750624843460840227,
 5779903777733962844,
 5799531893223586458,
 5801285380328622427,
 5806039985642855096,
 5879533156262070838,
 5889126109087941811,
 5905051422086405927,
 5938984576771641234,
 5962451360444563367,
 5986037612469200496,
 6023865108912042440,
 6049785439366761614,
 6064595880165020619,
 6070970247353498510,
 6098605206430780449,
 6116464130294403607,
 6122934530478519801,
 6150090257971150017,
 6166065378622704404,
 6198307587240214960,
 6204944347213293942,
 6220847999332178356,
 6232492269159044807,
 6234930871538147758,
 6262841230086289949,
 6264364458061685488,
 6267771825452287061,
 6288445842539248947,
 6317926867000418796,
 6332565246182501322,
 6404569898667835532,
 6419712945560014019,
 6464348338527850554,
 6480357434506567791,
 6539959124219116392,
 6572690767876555543,
 6597768662667489475,
 6597983453793337280,
 6631757153124741841,
 6631893489958899803,
 6640752594751216025,
 6668887069054142551,
 6704598838011540725,
 6708141604972311142,
 6724375049518009324,
 6751906246875967649,
 6765073978511963806,
 6779562535677168482,
 6826515621767838631,
 6875369516670876232,
 6914750135397878804,
 6922632766372070464,
 6929928383094252416,
 6936477402888535239,
 6970800422549976561,
 6985989271562289673,
 6992868177979592023,
 7002515446915619058,
 7007003297763794436,
 7014863560555513828,
 7023620996158891358,
 7029396129325719562,
 7032048126210952560,
 7041730643912578183,
 7042152873032759236,
 7046379850003309712,
 7065239911766460173,
 7070555557075794622,
 7086945417901847497,
 7105359344065295137,
 7119918875908384799,
 7139541019841169923,
 7196016872209902268,
 7221645881620243073,
 7233701091713551965,
 7245610086945774223,
 7264599055576279549,
 7264733129254129302,
 7267375968060049678,
 7273808698005665705,
 7284403121311465049,
 7288884190716830543,
 7336313834595872271,
 7353567953905112235,
 7358857170038129131,
 7402609239155108606,
 7447837304039048449,
 7451823951764636968,
 7463871755604111263,
 7512600910434231438,
 7543879591140518446,
 7572795345577928939,
 7576574547629699356,
 7586870289028771156,
 7586949343857555269,
 7590170331667140164,
 7605717256678352765,
 7610224088593907423,
 7641365079564560002,
 7641805214591810162,
 7664408132782564716,
 7721164141360704808,
 7726349659490295219,
 7727825530379273053,
 7735210845999539372,
 7736080671173437017,
 7738058666231999878,
 7753775826753824773,
 7754794048474479150,
 7766250438405005043,
 7804999398574389599,
 7815258071961218075,
 7816427477498327101,
 7820584748549110105,
 7826449443626647741,
 7934103980026732017,
 7982384314356586506,
 7984672678297688740,
 7997218022319749424,
 8023623688519544332,
 8049055963688342379,
 8067489480903535024,
 8067897711025429953,
 8070558855125722674,
 8076549449581024043,
 8087553844272650070,
 8118046800480174342,
 8131554249272658553,
 8160667040437994655,
 8189122927585332969,
 8195030409729380348,
 8213817696376199484,
 8242035105429673674,
 8263969539306689447,
 8274231271397262591,
 8282840613247092336,
 8395621917387012808,
 8413558542049059754,
 8442928648791656830,
 8443840048179803128,
 8455446443714796356,
 8459660542115909970,
 8460712404206805701,
 8494595990824157619,
 8494763562519130801,
 8497782881180506913,
 8500974247653924940,
 8501796086661118753,
 8514231736313948445,
 8534398700246007081,
 8540811162439173094,
 8563831142507873525,
 8565250933994196510,
 8591356659592885582,
 8602226924705431908,
 8618336374025883585,
 8629169480120469572,
 8644931766545549368,
 8645642686473708628,
 8658456654739799479,
 8681724991928604124,
 8682440863173047512,
 8697620253425868959,
 8726336886471952976,
 8758422369781029411,
 8777962667596552465,
 8789279460922105131,
 8791890665368430794,
 8803711870189910977,
 8852708786245585243,
 8869888494132684245,
 8874849506441254133,
 8880121597063492612,
 8935589408717176617,
 8945952918647295565,
 8971576789159450109,
 8979537786938712883,
 8982013177812162195,
 9007552876922927579,
 9011905890878548886,
 9046271195406550844,
 9064787769109500946,
 9069283069873325900,
 9102150555208283035,
 9111078617775819038,
 9124010241211887806,
 9126019227179029392,
 9132678164084161445,
 9141951425188555065,
 9158074215019937335,
 9178917234910208523]
In [53]:
#number of unique users that performed all the events besides Tutorial.
users_all_events_without_tutorial =users_per_events.loc[:, users_per_events.columns != 'Tutorial']
users_all_events_without_tutorial = users_all_events_without_tutorial[users_all_events_without_tutorial.notnull().all(axis=1)].reset_index(drop=True)
users_all_events_without_tutorial.shape
Out[53]:
(3429, 5)

This way we have received a table with user IDs that performed all the events, and a list of IDs of those users.
There are 466 users out of 7534 that performed all the events.

In [54]:
#Calculating number of sessions for each event
sums = users_all_events.select_dtypes(np.float).sum().rename('total')
# append sums to the data frame
events_users_total = users_all_events.append(sums)
events_users_total.tail(1)
Out[54]:
event_name user_id CartScreenAppear MainScreenAppear OffersScreenAppear PaymentScreenSuccessful Tutorial
total NaN 6419.0 7422.0 7467.0 4300.0 542.0

MainScreenAppear and OffersScreenAppear have almost the same number of sessions, slightly less users got to CartScreen. Tutorial screen had the least number of sessions.

Sorting the events by the number of users.

We will group the data base by event name and calculate the number of unique users for each event.

In [55]:
#filtering out the data base containing only users that completed all events. 
events_all = df_complete[df_complete['user_id'].isin(user_id_all_events)]
events_all.head()
Out[55]:
event_name user_id timestamp experiment_number event_date event_time
2826 Tutorial 3737462046622621720 2019-08-01 00:07:28 246 2019-08-01 00:07:28
2827 MainScreenAppear 3737462046622621720 2019-08-01 00:08:00 246 2019-08-01 00:08:00
2828 MainScreenAppear 3737462046622621720 2019-08-01 00:08:55 246 2019-08-01 00:08:55
2829 OffersScreenAppear 3737462046622621720 2019-08-01 00:08:58 246 2019-08-01 00:08:58
2832 OffersScreenAppear 3737462046622621720 2019-08-01 00:10:26 246 2019-08-01 00:10:26
In [56]:
events_unique_users = df_complete.groupby('event_name').agg({'user_id': ['nunique']}).reset_index()
events_unique_users.columns = ['event_name','users_number'] #renaming columns
events_unique_users = events_unique_users.sort_values(by='users_number', ascending=False)
events_unique_users.head(10)
Out[56]:
event_name users_number
1 MainScreenAppear 7419
2 OffersScreenAppear 4593
0 CartScreenAppear 3734
3 PaymentScreenSuccessful 3539
4 Tutorial 840

Visualising the calculations.

In [57]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_unique_users.event_name, events_unique_users.users_number, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='users')

ax.set_xticklabels(events_unique_users.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_unique_users["users_number"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Unique users per event')
plt.show()

The bar chart illustrates that the biggest amount of unique users performed 'MainScreenAppear', such events as 'OffersScreenAppear', 'CartScreenAppear' and 'PaymentScreenSuccessful' ah sapproximately the same amount of unique users (from 3,500 to 4,500 users). And the least popular event was 'Tutorial' with only 840 unique users visits.

Calculate the proportion of users who performed the action at least once.

Event performed at least once.

In [58]:
events_at_least_once = df_complete.groupby('event_name').agg({'user_id': ['nunique', 'count']}).reset_index()
events_at_least_once.columns = ['event_name','users_unique', 'total_users'] #renaming columns
events_at_least_once.head(10)
Out[58]:
event_name users_unique total_users
0 CartScreenAppear 3734 42303
1 MainScreenAppear 7419 117328
2 OffersScreenAppear 4593 46333
3 PaymentScreenSuccessful 3539 33918
4 Tutorial 840 1005
In [59]:
#calculating proportion of users who who performed the action at least once:
#share of users that performed event at least once in comparison to total number of users that performed the event:
events_at_least_once['users_once_proportion'] =(
    events_at_least_once['users_unique'] 
    / events_at_least_once['total_users']).round(2) 
#share of users that performed event at least once per each event
events_at_least_once['users_unique_proportion'] = (
    events_at_least_once['users_unique'] 
    /df_complete['user_id'].nunique()).round(2)  #corrected line
events_at_least_once = events_at_least_once.sort_values(by='users_once_proportion', ascending=False)
events_at_least_once
Out[59]:
event_name users_unique total_users users_once_proportion users_unique_proportion
4 Tutorial 840 1005 0.84 0.11
2 OffersScreenAppear 4593 46333 0.10 0.61
3 PaymentScreenSuccessful 3539 33918 0.10 0.47
0 CartScreenAppear 3734 42303 0.09 0.50
1 MainScreenAppear 7419 117328 0.06 0.98

Visualising the calculations.

In [60]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_at_least_once.event_name, events_at_least_once.users_once_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_at_least_once.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_at_least_once["users_once_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action at least once (to total number of users that performed the event)')
plt.show()
In [61]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_at_least_once.event_name, events_at_least_once.users_unique_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_at_least_once.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_at_least_once["users_unique_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action at least once (to total number of once performed events)')
plt.show()

Among the events that performed at least once MainScreen is the leader with 37%.Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4%. 84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it.
To the contrary, only 2% of users that got to the MainScreen never came back, all the others (98%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viewed once by 50-60% of all users, thus 40-50% of users came back to this screens more than once.

Event performed only once.

Now we will define the proportion of users who performed every action at least once.
Firstly, we will define how many users performed the action once, then we will calculate the total number of users that performed every action and finally we will state the proportion.

In [62]:
#comning back to the the table of events/sessions per user
users_per_events
Out[62]:
event_name user_id CartScreenAppear MainScreenAppear OffersScreenAppear PaymentScreenSuccessful Tutorial
0 6888746892508752 NaN 1.0 NaN NaN NaN
1 6909561520679493 1.0 2.0 1.0 1.0 NaN
2 6922444491712477 8.0 19.0 12.0 8.0 NaN
3 7435777799948366 NaN 6.0 NaN NaN NaN
4 7702139951469979 5.0 40.0 87.0 5.0 NaN
... ... ... ... ... ... ...
7529 9217594193087726423 3.0 NaN 3.0 3.0 NaN
7530 9219463515465815368 1.0 11.0 4.0 1.0 NaN
7531 9220879493065341500 1.0 4.0 1.0 NaN NaN
7532 9221926045299980007 NaN 7.0 NaN NaN NaN
7533 9222603179720523844 NaN 59.0 NaN NaN NaN

7534 rows × 6 columns

In [63]:
#number of users who performed the action CartScreenAppear once 
Users_one_CartScreenAppear = users_per_events.query('CartScreenAppear==1.0')['user_id'].count()
Users_one_CartScreenAppear
Out[63]:
472
In [64]:
#total number of users who performed the action CartScreenAppear
Users_all_CartScreenAppear = users_per_events.query('CartScreenAppear>=1.0')['user_id'].count()
Users_all_CartScreenAppear
Out[64]:
3734
In [65]:
#number of users who performed the action MainScreenAppear once 
Users_one_MainScreenAppear = users_per_events.query('MainScreenAppear==1.0')['user_id'].count()
Users_one_MainScreenAppear
Out[65]:
246
In [66]:
Users_all_MainScreenAppear = users_per_events.query('MainScreenAppear>=1.0')['user_id'].count()
Users_all_MainScreenAppear
Out[66]:
7419
In [67]:
#number of users who performed the action OffersScreenAppear once 
Users_one_OffersScreenAppear = users_per_events.query('OffersScreenAppear==1.0')['user_id'].count()
Users_one_OffersScreenAppear
Out[67]:
681
In [68]:
Users_all_OffersScreenAppear = users_per_events.query('OffersScreenAppear>=1.0')['user_id'].count()
Users_all_OffersScreenAppear
Out[68]:
4593
In [69]:
#number of users who performed the action PaymentScreenSuccessful once 
Users_one_PaymentScreenSuccessful = users_per_events.query('PaymentScreenSuccessful==1.0')['user_id'].count()
Users_one_PaymentScreenSuccessful
Out[69]:
574
In [70]:
Users_all_PaymentScreenSuccessful = users_per_events.query('PaymentScreenSuccessful>=1.0')['user_id'].count()
Users_all_PaymentScreenSuccessful
Out[70]:
3539
In [71]:
#number of users who performed the action Tutorial once 
Users_one_Tutorial = users_per_events.query('Tutorial==1.0')['user_id'].count()
Users_one_Tutorial
Out[71]:
756
In [72]:
#total number of users who performed the action Tutorial
Users_all_Tutorial = users_per_events.query('Tutorial>=1.0')['user_id'].count()
Users_all_Tutorial
Out[72]:
840

Now we will form a table with the received data.

In [73]:
lst = [['MainScreenAppear', Users_one_MainScreenAppear, Users_all_MainScreenAppear], 
       ['OffersScreenAppear', Users_one_OffersScreenAppear, Users_all_OffersScreenAppear],
       ['CartScreenAppear', Users_one_CartScreenAppear, Users_all_CartScreenAppear], 
       ['PaymentScreenSuccessful', Users_one_PaymentScreenSuccessful, Users_all_PaymentScreenSuccessful], 
       ['Tutorial', Users_one_Tutorial, Users_all_Tutorial]] 
    
users_events_proportion = pd.DataFrame(lst, columns =['events', 'users_one_event', 'all_users_event']) 
users_events_proportion 
Out[73]:
events users_one_event all_users_event
0 MainScreenAppear 246 7419
1 OffersScreenAppear 681 4593
2 CartScreenAppear 472 3734
3 PaymentScreenSuccessful 574 3539
4 Tutorial 756 840
In [74]:
#calculating proportion of users who who performed the action once:
users_events_proportion['users_once_proportion'] =users_events_proportion['users_one_event'] / users_events_proportion['all_users_event']
users_events_proportion['users_once_proportion'] = users_events_proportion['users_once_proportion'].round(2)
users_events_proportion
Out[74]:
events users_one_event all_users_event users_once_proportion
0 MainScreenAppear 246 7419 0.03
1 OffersScreenAppear 681 4593 0.15
2 CartScreenAppear 472 3734 0.13
3 PaymentScreenSuccessful 574 3539 0.16
4 Tutorial 756 840 0.90

Visualising the calculations.

In [75]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(users_events_proportion.events, users_events_proportion.users_once_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(users_events_proportion.events)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(users_events_proportion["users_once_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action only once')
plt.show()


The chart shows that the least popular category is still Tutorial, 90% of users never come back to this page. The most popular page is MainScreen, only 3% of users saw it once and did not come back to it again.

In what order actions took place. Are all of them part of a single sequence?

We suppose that for most of the users the funnel looks as follows: 'MainScreenAppear', 'OffersScreenAppear', 'CartScreenAppear', 'PaymentScreenSuccessful'.
'Tutorial' event is an annecessary step in this process and it can take place at avery stage or can not take place at all. We will not take it into account when calculating the funnel.

Event funnel and share of users that proceed from each stage to the next. Ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B.

(Calculating the funnel for the sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful and the ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B.)

In [76]:
#table for calculating funnel
users = df_complete.query('event_name!="Tutorial"').pivot_table(
    index='user_id', 
    columns='event_name', 
    values=['timestamp'],
    aggfunc='min').reset_index()
users.columns = ['user_id', 'CartScreenAppear', 'MainScreenAppear', 'OffersScreenAppear', 'PaymentScreenSuccessful']
users.head()
Out[76]:
user_id CartScreenAppear MainScreenAppear OffersScreenAppear PaymentScreenSuccessful
0 6888746892508752 NaT 2019-08-06 14:06:34 NaT NaT
1 6909561520679493 2019-08-06 18:52:58 2019-08-06 18:52:54 2019-08-06 18:53:04 2019-08-06 18:52:58
2 6922444491712477 2019-08-04 14:19:40 2019-08-04 14:19:33 2019-08-04 14:19:46 2019-08-04 14:19:40
3 7435777799948366 NaT 2019-08-05 08:06:34 NaT NaT
4 7702139951469979 2019-08-02 14:28:45 2019-08-01 04:29:54 2019-08-01 04:29:56 2019-08-02 14:28:45
In [77]:
#calculating number of users on each step of the funnel according to the time when the action took place
step_1 = ~users['MainScreenAppear'].isna()
step_2 = step_1 & (users['OffersScreenAppear'] > users['MainScreenAppear'])
step_3 = step_2 & (users['CartScreenAppear'] > users['OffersScreenAppear'])
step_4 = step_3 & (users['PaymentScreenSuccessful'] > users['CartScreenAppear'])

n_MainScreen = users[step_1].shape[0]
n_OffersScreen = users[step_2].shape[0]
n_CartScreen = users[step_3].shape[0]
n_PaymentScreen = users[step_4].shape[0]

print('Visitors:', n_MainScreen)
print('Viewed the offer:', n_OffersScreen)
print('Added product to a cart:', n_CartScreen)
print('Paid:', n_PaymentScreen)
Visitors: 7419
Viewed the offer: 4201
Added product to a cart: 1767
Paid: 454

In order to calculat funnel metrics and to build it we will use the Funnel() method. We will buit it on basis of the previous calculatuons (table events_unique_users).

In [78]:
#forming a table for the funnel chart

list1 = [['MainScreenAppear', n_MainScreen], 
       ['OffersScreenAppear', n_OffersScreen],
       ['CartScreenAppear', n_CartScreen], 
       ['PaymentScreenSuccessful', n_PaymentScreen]] 
    
events_unique_users_funnel = pd.DataFrame(list1, columns =['events', 'users_number']) 
events_unique_users_funnel 
Out[78]:
events users_number
0 MainScreenAppear 7419
1 OffersScreenAppear 4201
2 CartScreenAppear 1767
3 PaymentScreenSuccessful 454
In [79]:
#calculating ratio of the users to the previous stage:
events_unique_users_funnel['ratio']= (events_unique_users_funnel['users_number']
                                      .div(events_unique_users_funnel['users_number'].shift(1))
                                      .round(2)
                                     )
events_unique_users_funnel.fillna(0)
Out[79]:
events users_number ratio
0 MainScreenAppear 7419 0.00
1 OffersScreenAppear 4201 0.57
2 CartScreenAppear 1767 0.42
3 PaymentScreenSuccessful 454 0.26
In [80]:
events_unique_users_funnel['percentage_ch'] = events_unique_users_funnel['users_number'].pct_change()*100
events_unique_users_funnel.fillna(0)
Out[80]:
events users_number ratio percentage_ch
0 MainScreenAppear 7419 0.00 0.000000
1 OffersScreenAppear 4201 0.57 -43.375118
2 CartScreenAppear 1767 0.42 -57.938586
3 PaymentScreenSuccessful 454 0.26 -74.306735

To check the funnel

In [81]:
plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_unique_users_funnel.events, events_unique_users_funnel.percentage_ch, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_unique_users_funnel.events)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_unique_users_funnel["percentage_ch"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Bar chart for events funnel (percentage of user losts on every stage)')
plt.show()
posx and posy should be finite values
posx and posy should be finite values

We can see that the share of users that do not make it to the next stage of the funnel is getting bigger with every step of the funnel, with the biggest losts on the last stage (74% of users that get to cart screen do not complete the payment).

Funnel

In [82]:
#building the funnel

fig = go.Figure(go.Funnel(y = events_unique_users_funnel.events, x = events_unique_users_funnel.users_number, 
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#3ac7c3' , '#c6f7e7']}))
                                                 
fig.show()

The funnel chart illustrates the fact that the smallest percentage of users get lost on the very first step, 57% of users that got to the MainScreen will get to the OffersScreen.
On the next step 58% do not get to the CartScreen and get lost. And losts on the last step are the biggest, just 26% of users that got to CartScreen make successful payment.
This way 6% of users that got to the MainScreen will make a purchase.

Conclusion.

There are 5 events forming the funnel, the most frequent one is MainScreen, it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen is 14% and the least popular is Tutorial event (0.4%).
There are only 466 unique users that went through all of the events ( out of 7534 users, 6.2%). If we omite Tutorial event as the least popular one, the number of users that performed all the events will be much higher, 3429 users.
If we count number of sessions per event we will see that MainScreen event and OffersScreen have very close number of sessions (7400-7500), slightly less sessdions are there for Cart event (6400) and even less seeeions are there for the Payment event (4300).
As to events that have been performed at least once MainScreen is the leader, 37% of users viewd this creen at least once. Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4% of users.
84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it.
To the contrary, only 6% of users that got to the MainScreen never came back, all the others (94%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viwed once by 9-10% of all users, thus 90-91% of users came back to this screens more than once.

If we calculate the number of users that performed the event only once, we will see the least popular category is still Tutorial, 90% of users never come back to this page. The most popular page is MainScreen, only 3% of users saw it once and did not come back to it again.
Next we have calculated the funnel for the sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful as Tutorial obviously is not part of the main funnel.
We have got the following numbers: MainScreen visitors: 7419, viewed the offer page: 4201, added product to a cart: 1767, paid: 454. This way we got 6% conversion rate for this funnel. Other users might get converted going through another steps, because general number of users that got to Payment page is higher (3539 unique users).
The biggest share of users got lost on the stage of payment : only 26% of users that put the products to the cart eventually pay for them. This stage should be studed more thoroughly.
Possible issues to improve:

  • add more payment options (MasterCard/Visa/PayPal/payment from bank account, etc.),
  • add secure certificate to the payment page,
  • add special offer urging people to pay immediately (eg. discount, free returns, free shippng, etc for purchase within 1 hour ).

Part 4. Study of the results of experiment.

How many users are there in each group

In [83]:
#calculating the number of unique users for each group
experiment_groups = df_complete.groupby('experiment_number').agg({'user_id': 'nunique'}).reset_index()
experiment_groups
Out[83]:
experiment_number user_id
0 246 2484
1 247 2513
2 248 2537
In [84]:
plt.figure(figsize=(9,5))

# plot barh chart 
ax = sns.barplot(experiment_groups.experiment_number, experiment_groups.user_id, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="experiment no.", ylabel='Number of users')

ax.set_xticklabels(experiment_groups.experiment_number)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(experiment_groups["user_id"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Number of users in each group')
plt.show()

We have 3 groups with almost equal number of users. Still there is a difference that might skew the further results. We will pick up 2484 users randomly with sample() method.

A/A test. Checking a statistically significant difference between samples 246 and 247

In [85]:
#checking criteria for A/A tests (The number of users in different groups doesn't vary by more than 1%)
experiment_groupsAA = experiment_groups[experiment_groups.experiment_number!=248]
experiment_groupsAA.columns = ['experiment_number', 'users_number']
experiment_groupsAA['ratio']= (experiment_groupsAA['users_number']
                                      .div(experiment_groupsAA['users_number'].shift(1))
                                      .round(3)
                                     )
experiment_groupsAA
Out[85]:
experiment_number users_number ratio
0 246 2484 NaN
1 247 2513 1.012

Number of users differ slightly more than by 1%, thus we will pick up 2484 users randomly with sample() method.

In [86]:
import random
users_247_raw_ID = df_complete[df_complete.experiment_number==247]['user_id'].unique().tolist()
users_247_ID = random.sample(users_247_raw_ID, k=2484)

display(users_247_ID) #we got list of 2484 unique user IDs for group 247
[5434797934758514583,
 7041725694001352181,
 8164821368561674670,
 4063469539476738161,
 8971576789159450109,
 5345027845561189180,
 3204810626486562900,
 5211717548134904196,
 2005314201504917278,
 3420371657225579529,
 3773590489309316043,
 4281479743095364057,
 2457909310456664267,
 6515929547164029764,
 3649034844806398516,
 2565014492448549049,
 4463574575890770992,
 80189413012306034,
 3808501295754195603,
 3437992091268681789,
 6996377124129943223,
 3885755640026707322,
 3232686571385637466,
 8559349295842617988,
 4052037525335308563,
 8957738111190945364,
 3300490215914857621,
 1561929312251625465,
 8573758417293729771,
 6900761739159412343,
 7635987661756527075,
 7586870289028771156,
 4455166999438232842,
 8071501282247284219,
 6998684233879494675,
 8750541469351821193,
 1379111342050013977,
 4171803676862750023,
 9116152523400748244,
 1529531054838639703,
 6070970247353498510,
 4411383770693565301,
 5911149890644227236,
 1084823616235555727,
 3929161457308552287,
 2192727391143803008,
 5156509171070673793,
 7427153236891171640,
 2709220656435554465,
 5657373150859017272,
 8056993866328083291,
 6821579127445036070,
 2981446198034507991,
 7176043484079675376,
 4438088437909683877,
 2460448352470841440,
 2785300585388283683,
 1741785633060377552,
 840247365306142821,
 248717507504048864,
 6690524421818999342,
 5511590840095686855,
 2032372810076622488,
 5006225713885460557,
 6260142250503168042,
 6343888653134816913,
 4342525543743792989,
 4598149768920567729,
 2374859313935223081,
 5287660122200561379,
 6561956771435011437,
 3191946297519275479,
 1891735264651988282,
 7060940278888962128,
 6598157157289930061,
 2956577131512679581,
 3746337140292909864,
 8111734619558428480,
 3077347585945449403,
 5440812812156911778,
 8516165470648523226,
 5406554203316670962,
 7660956866918094367,
 2095513126738999163,
 1738981368825823516,
 95206628315371849,
 2470999879924103731,
 483756982373773158,
 8026098042912885716,
 6093254895669475889,
 8276290564820986605,
 604405025443423456,
 4885257867404324290,
 7086174885550635991,
 6782197887982484285,
 7754650590239285010,
 6885047672362855775,
 1204908440567646625,
 6304358625369144825,
 3211698728226180240,
 1431362592250025125,
 1586969823172250754,
 8712466683546436835,
 6826908666924960365,
 8037263541858200996,
 8419358778925334550,
 6572690767876555543,
 1670211418297317122,
 6175313603189721946,
 3221676487280452576,
 8012620272409594699,
 4058609133338350099,
 6892072086341498458,
 3073362467512058984,
 5530248251237125348,
 97337799800097594,
 5178688081041235439,
 6483696045451304132,
 2631876180975097434,
 663948373672770349,
 8574414036177860633,
 350294727339902264,
 6146544816446732640,
 2882200197863522104,
 412269363258780118,
 3026872631070465498,
 4076638110861270577,
 4890328091489891497,
 8016534965909266602,
 3248933996331843734,
 2207627660352986049,
 1715800039370609036,
 4393383658902309167,
 227598988643318117,
 1599641448679221409,
 7156980657883694355,
 4513253165150520238,
 797066677504151142,
 3285958772927629473,
 6049698452889664846,
 6347164841918296916,
 742580055368777840,
 6994783392016081031,
 780798538749079517,
 3160905430797231576,
 4005352094921500182,
 9096238114140624904,
 5933072019017392890,
 3836530606841486281,
 1031265651214490703,
 1345903714229858630,
 3536055684618607603,
 780170411326456210,
 3551574897384590025,
 1225191038684666184,
 3484031245142231756,
 6446869071403350071,
 4348726185529335224,
 7815258071961218075,
 7524162245056732296,
 1335063544016004118,
 1803879332490491518,
 6159839327402499879,
 7789250019385955507,
 4480708264041321895,
 3796035091082529871,
 3582207040694686864,
 2928789257267509804,
 6609496381739008570,
 6163512637322707646,
 2207792128621220799,
 4962644996224402898,
 8515445940451596447,
 7478611464964500431,
 577703227672734143,
 9146630619602594005,
 3523280358013629980,
 772204008534156278,
 1861025584333564975,
 250419112136655315,
 1897338436453039992,
 2886396675297658624,
 5359844283694947197,
 1817601080103596916,
 3871721376096734504,
 5854783240198438284,
 8674975208611386264,
 4290402023926651764,
 7790656657113044342,
 4871079487372624585,
 5508858000645147218,
 2180779893563165231,
 1362092663681182910,
 8558593082514726822,
 3032338719854871114,
 7211145819222855510,
 6832504382338094957,
 8071397669512236988,
 741217858549528863,
 7525415361824974551,
 5510074219208192462,
 7512600910434231438,
 2844576250831680617,
 2750875175953778212,
 617306855803490709,
 5125359155373364263,
 1880319123864140842,
 4111480625662873403,
 9160437016685643194,
 3629657580710589645,
 4432111793939255075,
 8092568022639166769,
 3684378369323273051,
 596187849598934952,
 2617769254195403935,
 7213580412554817871,
 2874246907812105267,
 4596237489472252259,
 3093106157885860283,
 9196211109317457061,
 5085957470529806686,
 8329821754358160004,
 6504780035871286715,
 5682865148355002285,
 1879130487676858089,
 4607205867762114846,
 3911101470526499807,
 8729033948279972045,
 415393140403992081,
 4476597660224521966,
 7105359344065295137,
 8937198885181670270,
 1180716372926977637,
 9150135344042745280,
 8996651817113323223,
 7062405239100452586,
 8463802514068906041,
 6696922283717026581,
 28755862496905658,
 5236297813268426476,
 4035390462425372458,
 4823397055602788835,
 7536805326914916609,
 7153351231169442634,
 4619061466701287834,
 6972336377288863960,
 2088030388420733246,
 6075320301699028338,
 8945952918647295565,
 6404569898667835532,
 7717097133758101868,
 2089674818237535094,
 2256353318851497275,
 7583394010848707296,
 3248559101713000283,
 1559671681759212771,
 6787941801882943547,
 4020319949421921344,
 5533243256109303942,
 6799383944721464796,
 4449003779476868447,
 6746926855076274118,
 4699738035785774703,
 692448913275740475,
 4578652018867345590,
 7740948947612091870,
 2317120910450320490,
 89892936241398759,
 1262087367072548999,
 8532287688124147551,
 433700751848178337,
 7721936847161203073,
 3123721304325646854,
 1818888580792312774,
 6192195739359841186,
 6815658305565891028,
 7980064927529100411,
 948465712512390382,
 6252230086817013506,
 5943471568657248517,
 4891648599712893198,
 7701922487875823903,
 7604471555188247197,
 7311730072021703895,
 1143559367744842371,
 6140853948187673009,
 827216445999611892,
 8087553844272650070,
 5996659978193962314,
 5851462201183951457,
 9016099160389873252,
 8038978653690209956,
 537476820791202839,
 3193582066462778787,
 837542968599109939,
 8435417606546387032,
 8772235509545694181,
 5258511259371092883,
 4999260814175796103,
 8922233068786578578,
 980160307431341942,
 8885405957629088900,
 2754977819347976900,
 1635717576013174577,
 2310567447000311087,
 1566802255817691325,
 3138310861905027081,
 8347608816036482515,
 6446471938317348623,
 6761169916474158355,
 289228914209940039,
 9131593510407366655,
 2699157127198196365,
 7275750595741415851,
 62735763365307800,
 475525573494502277,
 2384329576990003082,
 3881891539348553720,
 4266455871264979429,
 3717416450580849209,
 2527682174074702466,
 2619428748601141020,
 6757980303432985206,
 6530789698008842148,
 2892388096807008990,
 4277982896725647346,
 5736280603796322581,
 3454683894921357834,
 1477475127703333342,
 2298594150724479628,
 8528069328405179526,
 8820704299342029288,
 4746496276455001678,
 9070527898450313539,
 6184735480528092692,
 1534051565705796373,
 1259262890422987763,
 1272316250977084199,
 4721204817879592643,
 1359329337915319993,
 4834512160020311565,
 7495249977808105846,
 1754140665440434215,
 3941321847932123889,
 2784928666669179848,
 1670499237495260603,
 4906589897545968296,
 4722282609251751140,
 281554451473665610,
 1526073585896640732,
 8739369198187851757,
 4633901593749805109,
 6480357434506567791,
 6160101665506526484,
 1414294169776581033,
 8840207768107775025,
 2626127786567605048,
 7056229000292242885,
 5918021282103598553,
 4608405459427890239,
 7300399240480510813,
 769261358925388390,
 4123270163943493691,
 8238253336270343940,
 4799289157564817423,
 4819017223835777121,
 70788136061563551,
 550970718764346073,
 4960808833493228395,
 5335957243538234701,
 474977146243168982,
 5435933625972989901,
 6537245976200136132,
 6589262034654130848,
 8053642028543447582,
 269309715259166977,
 7984672678297688740,
 5652604965383417961,
 1976242610478736343,
 944989533256675646,
 747706080571680169,
 2046817531179798625,
 5518942930638974593,
 1270866102649465227,
 3807928414254040303,
 3617745956307343362,
 6122171114561079367,
 7472834843463961465,
 1700856018280760225,
 5819359051161727651,
 741809788823411824,
 3978799009258242156,
 3015692701687158638,
 5794857060312167190,
 3590237880935165984,
 8294276109582672181,
 7486666725253402125,
 416669255233170069,
 7483068627836554893,
 5688900391652229132,
 5680622852857465310,
 3452784367893691732,
 4637708277741596490,
 1119518190440550684,
 305594859284003311,
 5556707202539887138,
 1051093549596532640,
 6144835774629021314,
 1483777619988800179,
 4432437748631167125,
 46236857309457992,
 8925830967888917007,
 1362955832619669880,
 4981103726341816137,
 2045565577415967787,
 1219817261256577411,
 8215030345264803518,
 7046787466259412749,
 4560747082510588984,
 7884230493916284844,
 1066075817123227424,
 1729553690289446391,
 744631185401435026,
 4671054288201913373,
 2031466986953527006,
 4878048079697205617,
 4502964825932212315,
 8987628949126410882,
 1126021718529336913,
 8610656671771567917,
 217208764195819861,
 4235693377656213108,
 916839831153708126,
 1426241881816094654,
 7534985591088335987,
 2391039980018424978,
 1755607441054582137,
 80958523570010555,
 3453963973976592528,
 3444764689234781918,
 7240090456049165417,
 1802961404368128531,
 7552870437360590821,
 3514113807094977722,
 7737348652733352322,
 8529577928172344956,
 5022646214748735876,
 5397421658181970872,
 626934752699266183,
 1044467853625099839,
 3322351381011335790,
 6945944578885432904,
 3382281125155660869,
 863855012668260540,
 6857784436429878445,
 6853642322140815386,
 5777376051033418618,
 5995977603596872012,
 175139370713921222,
 8420938487977262328,
 4001577265037386125,
 5075672656267947418,
 6017740127847509913,
 3299546918212252298,
 6191431714463699517,
 6166407559168000390,
 346062195302431477,
 8036336617022688206,
 6817319572438766341,
 4627698006712985403,
 8377201126969110825,
 7259191060769160805,
 2056649048651349082,
 375578807808788687,
 2958714898406832841,
 6888229484609913653,
 7133515764276214703,
 7759376189177788298,
 51078901260117593,
 4043451913990355083,
 934007775207699512,
 784865651125640705,
 689154991320046746,
 3583643616592343673,
 7623118146907918570,
 2191459293737538755,
 3493811924644018092,
 4340581176569788417,
 6669193273876458803,
 1110896155120934852,
 2664251778992193918,
 9042318653803239039,
 7304175668450257182,
 2451709971826885152,
 731093651646879622,
 3809364837620116089,
 3078650355523867432,
 6194835849195592223,
 4222410693437267960,
 6752579264606989757,
 6713602522619131191,
 5816289132615219593,
 3844996967009380096,
 4807704234455985232,
 7491260355484014001,
 7917934912290877422,
 5856305377951208909,
 7226720158127039572,
 7485950416208479438,
 5510996512847650643,
 3663291270136112912,
 1763234204193593145,
 1904650264626572363,
 184441571956328947,
 8162793946935005204,
 7377388476694839836,
 6702787959820350815,
 1203458062393959164,
 8742586081622553713,
 6203916031538615993,
 8855118873665049688,
 8415324821149653065,
 6336725403382175656,
 2500124605531025107,
 2819500441650310333,
 4336603908765108826,
 6055884364748871593,
 7666101191331672688,
 5144400385975959807,
 2194900799826633240,
 4286084469604846869,
 666058456191421549,
 2390375804834306115,
 7723866601686652745,
 6618793602854979814,
 7521107799334742020,
 7252846004949488535,
 5704598557533990208,
 8212606476487920439,
 4086924768271071570,
 1934820574417048407,
 32131536909829030,
 5180016769043519762,
 4877161838950700944,
 2176177620330762377,
 7638964351338328505,
 2126522902876821289,
 4043406323077089983,
 4358497452144799972,
 3990271807424232955,
 4882051544540389942,
 1097963955552992526,
 1319844410132864332,
 5527716311808271810,
 3455659692720637390,
 8790010330392108982,
 8686446138620178134,
 1194587399476440217,
 6431751553147691823,
 5408209889381861058,
 446020871492954696,
 1414726611892308127,
 2783631019518578088,
 1835021996251252097,
 6031379944093363239,
 3722204315779155652,
 8951411654526979325,
 3442485912315202630,
 1633561929294530840,
 3163827168022400125,
 7572795345577928939,
 8155357846588359356,
 417948724994764160,
 7180866956895127482,
 7606256489676361357,
 5788632145096365701,
 5691247363724041873,
 6340307811102438090,
 8110163251586006758,
 3670880358399219515,
 2882924403918584279,
 8551304834996164071,
 5964309771035274630,
 7111401858893279918,
 7372738292332898844,
 1355516522578982861,
 430485606768201570,
 2157460938665527156,
 7738145294063220121,
 7055060642607177836,
 195951501653378850,
 1452172630155482926,
 1066982802038964480,
 2380401036382252064,
 919386348696908415,
 1599801003111158560,
 7968994188742037558,
 6518516812680770359,
 7751439266407244878,
 596338584496047846,
 6116464130294403607,
 7042152873032759236,
 7533406193277938189,
 8878573080913900043,
 6270589772265885738,
 7644761080823176046,
 8224250491819188800,
 1926043475802601010,
 5837760886918767127,
 8030362956487106589,
 7437717400582892123,
 8740589539890522925,
 7664846897629303621,
 5510275687753597328,
 8517528459739984686,
 4703081243737778270,
 4004591705931404182,
 1047358240453682914,
 7857002844406318302,
 8256915806052729557,
 2677625909816111756,
 8108897223490300436,
 6254218025818064974,
 1978381084040678038,
 7225488958072414056,
 4333249048018318391,
 5801359697252720810,
 3034063235370171931,
 4857087115411142922,
 4417916511688687613,
 3425905933324520428,
 990229035205428080,
 5025956312643488924,
 8604753852490386123,
 7627124484909605677,
 5307812315857857498,
 8281485486574148605,
 1397725489131598078,
 2591399005740119082,
 3063248032210795834,
 2254821219381623694,
 8682440863173047512,
 8185899898731304456,
 5326768469896924337,
 7878253570769883982,
 2525867977967066505,
 8251111755333577908,
 2896702740085057040,
 821491663746261171,
 67153306548697605,
 8671477213618466470,
 7536911632465209937,
 8618336374025883585,
 7682434851425118819,
 3609155040662138444,
 787019918811618763,
 772246766314118047,
 971715444044470678,
 6463833207814436303,
 7873766874640172744,
 6203919060210406808,
 1861754747503374810,
 6921873124105213121,
 6279514806004087328,
 5010980729198725619,
 4415839030454055631,
 8681724991928604124,
 6117045252034262016,
 8863054179610410577,
 9182444363400946277,
 8898608114522007125,
 2458327509065800490,
 9037775697420265211,
 9187094690649224379,
 5545961230737458678,
 3155801589468317400,
 8886131408880611316,
 205567504954789840,
 2364106345024493331,
 9018466604094005093,
 2765695236497056545,
 3495939723068664022,
 470371526830309747,
 3317086698046307220,
 2823790631517317353,
 2965723328108616215,
 2558184766965628899,
 8139584978936500733,
 7638098162558108811,
 4280428108846588501,
 8985399978029402083,
 4212590716303110107,
 5677936338071938199,
 3069627736123337921,
 7310645527146535058,
 5509400195680108507,
 3812956721715932844,
 7724422949486786548,
 1886989438564499653,
 316545280028060457,
 4941099502335148638,
 8726413942744351490,
 887025819745007759,
 8224196328793260986,
 1390156457112470704,
 3163486935350844107,
 7547288962150307533,
 4599628364049201812,
 5924419344399155002,
 476584551711843059,
 3936804896547636799,
 6519468181324065430,
 4180967379477432277,
 1011982461038688201,
 2334662598911346748,
 3086139459284928882,
 5892436657040909299,
 5462686011002931011,
 8379429431611543492,
 3831448114075113519,
 1387165560212667162,
 6671123045946650915,
 3024197421055968640,
 5932841297723175385,
 8274231271397262591,
 7559713240699306882,
 9168350852884069895,
 2426020021473513274,
 6200904144834376698,
 8839027056358412603,
 7964402433748787296,
 8289578771469198941,
 1533994412892774248,
 5172124289118220039,
 8314293504792784061,
 2037546186463620828,
 5834318767027554059,
 2415188714574108464,
 1576055881633157090,
 2569440404673056428,
 7826656187815202969,
 4022914497222644222,
 2075008807303354325,
 7841679885317648012,
 7160091177059325015,
 5786852324768365446,
 3351264438703766360,
 3617783164572524320,
 6288445842539248947,
 8809171516767331956,
 6040191502175430751,
 8485802187894064416,
 1056436874470887468,
 8420911729771828510,
 3355425170283983440,
 2300264632290823791,
 496872847888674236,
 3620710510805782942,
 1000608915523616548,
 2080457149997549644,
 5774114902579549523,
 4643186033831435634,
 7275315765924178906,
 6956382183518365612,
 7094032985114029887,
 6241001729771846777,
 2372212476992240858,
 4939283755029305754,
 5750624843460840227,
 7868651256196435720,
 742351075583407193,
 4672591758778297334,
 5868681301281774423,
 6414331743009155825,
 8369105451845357811,
 8470902720347665438,
 8497782881180506913,
 3987249516670813491,
 4917023139263453733,
 1155270440642450625,
 1784088129523134891,
 1637183525585696535,
 8981353384232386376,
 3642568430494444728,
 4400195465524878466,
 4980045449118619005,
 439840184018318461,
 2426669476115918968,
 4502974168843754195,
 3682203984385477559,
 3250609420089827943,
 573618694510478945,
 260777406136889478,
 7939747556084232140,
 2475505403190905570,
 1152127394159350271,
 7970558449771374202,
 7708635796013762973,
 7007003297763794436,
 2658624974683477729,
 2485641541735752193,
 4706734030984523872,
 3746015443593251752,
 4114325253605643942,
 5558078826686084873,
 6471978249167861975,
 7196985028038498695,
 8934440297891246141,
 867539764377766519,
 8177272759144174786,
 6087347791668573225,
 3416181679258622505,
 2886628960044364745,
 4050125088571572193,
 8842868973507070873,
 6279438710893767649,
 7303879258190220659,
 8829156529073249712,
 6421557481015958273,
 3298560110921736687,
 2154133037313861585,
 4139601392661009783,
 4855320091095146156,
 4406813263279372508,
 5598134942196237435,
 2463410628813581862,
 6852550964038408882,
 2431455045762717569,
 6841693852773276907,
 5086454081285953956,
 5551578444617739349,
 7717030435243114250,
 4668832298353666743,
 3613025750924969400,
 4827775328614831155,
 1287843092903530435,
 5517827981829293782,
 298114340872823027,
 3855778861450874791,
 424858719121196997,
 7063180959056527272,
 6823759645951222686,
 1413091616777106146,
 2858792115038680508,
 3783170467492955597,
 8428777209401822375,
 6713384618608633696,
 4846621154567119913,
 4727437875646068398,
 1306039002894133978,
 7122410673383697031,
 6511166396888043920,
 1335983822544135331,
 7971505700009454369,
 1830149746426439847,
 4390837918625134446,
 8395621917387012808,
 4002214050220035919,
 4781569103211458267,
 5806050998431904024,
 2934598532887670123,
 2039979342104336040,
 2300292234025330845,
 8845318272622873556,
 4903282557313240250,
 4693745759546859515,
 7421562291363261516,
 7267375968060049678,
 8776491384529524344,
 6653592794823360382,
 4715449382983361650,
 3430716515170011775,
 865598932110304921,
 2382232476374805470,
 2516735914917326476,
 7280001668765324674,
 287920515800375800,
 4060672916668475018,
 6217295124800833842,
 229242238498134737,
 5751015140771966473,
 2265504054110070262,
 120464421886668120,
 4328861915111898325,
 5784203384430915788,
 4869735332644755950,
 3776850235681345880,
 5642087445710995915,
 8689033606257485077,
 4986758584945043091,
 8523590611747568369,
 6124233129910171828,
 3043287401310870279,
 4483078175460037818,
 625449045723710637,
 4407188961960390866,
 4020791024033346072,
 4899590676214355127,
 1501430434311491538,
 7092770789375305513,
 2882506928148115110,
 2011152258955517776,
 4608232147513646349,
 7400903796965575351,
 1373199082330812873,
 6054246013759836289,
 6129377181590250548,
 5951389957997725523,
 3351614451709259754,
 5986169048870188421,
 273012261803866429,
 2013880302311166256,
 62853461369047667,
 4491384517431481161,
 1640129893004506946,
 1496571865153692716,
 6119965954948365960,
 8419917136735884488,
 6936517579657223951,
 8652127031019540255,
 4169814227533210428,
 7548033668893710883,
 4330204663503526606,
 1556688949284346937,
 791585118237377148,
 8101588829907918400,
 4284716907364183621,
 3699634849210177019,
 7403794962485129716,
 3557017219662831140,
 5302119963407980855,
 8325194800804806038,
 4581794435043716955,
 6874363509598691386,
 4086772078235573322,
 1713241506527703337,
 7692637175775763691,
 7493492404943507506,
 1173352039090291667,
 8645506249912100546,
 8746576957702894468,
 2391591272127366891,
 3895296976043581348,
 997959739029313547,
 1703702289125317854,
 2531917347516408515,
 6979088364342662031,
 370831337408248129,
 1431571411773565676,
 2419517597868248322,
 4164610646466855086,
 115925259125407295,
 4718002964983105693,
 3867508215542095930,
 3275811698770626238,
 3671983803504319718,
 6011030719633223084,
 267043891540659265,
 402711886080464802,
 7340109038762980814,
 7027449428163177910,
 5113015789596468689,
 1355368547264388995,
 715803049899679304,
 1456479928857490458,
 2674429336791405101,
 3579996687545345346,
 1896534334106476174,
 557622585227857656,
 1478347681767261393,
 7188522580408167660,
 4553145270890751340,
 1623784750227488837,
 782163553814306342,
 4050331896361332537,
 4923844390335546332,
 7613906970022345231,
 6400775428226321371,
 1795271977194910088,
 9111985047183779356,
 1616856457908967248,
 6629734925182446179,
 3316969965060175629,
 4911304447140623726,
 1955131698085478194,
 7113552983611206319,
 104485797972356373,
 7696284448973569805,
 7505376122915623431,
 8347117036139943300,
 3783993475053044166,
 7824899892346910615,
 8657779206582872594,
 3187166762535343300,
 4647441570935782596,
 6122283482359719876,
 1477155831401052035,
 2984099745167047759,
 5300905056036505980,
 7746876408024632391,
 ...]
In [87]:
#ckeck
len(users_247_raw_ID)
Out[87]:
2513
In [88]:
#ckeck up
len(users_247_ID)
Out[88]:
2484
In [89]:
#setting a df with visitors from group 247 equal to group 246

#users_247 = df_complete.query('user_id == @users_247_ID')
users_247 = df_complete[df_complete['user_id'].isin(users_247_ID)]
users_247.sample()
Out[89]:
event_name user_id timestamp experiment_number event_date event_time
155171 MainScreenAppear 7712419871006855851 2019-08-05 11:48:53 247 2019-08-05 11:48:53
In [90]:
users_246 = df_complete.query('experiment_number == 246')
display(users_246.sample())
event_name user_id timestamp experiment_number event_date event_time
182905 OffersScreenAppear 2083934669020211807 2019-08-06 06:59:55 246 2019-08-06 06:59:55
In [91]:
#combining the 2 data sets: joining the two tables with 246 experiment visitors and 247 experiment visitors together
frames = [users_246, users_247]
users_246_247 = pd.concat(frames)
users_246_247.head()
Out[91]:
event_name user_id timestamp experiment_number event_date event_time
2826 Tutorial 3737462046622621720 2019-08-01 00:07:28 246 2019-08-01 00:07:28
2827 MainScreenAppear 3737462046622621720 2019-08-01 00:08:00 246 2019-08-01 00:08:00
2828 MainScreenAppear 3737462046622621720 2019-08-01 00:08:55 246 2019-08-01 00:08:55
2829 OffersScreenAppear 3737462046622621720 2019-08-01 00:08:58 246 2019-08-01 00:08:58
2832 OffersScreenAppear 3737462046622621720 2019-08-01 00:10:26 246 2019-08-01 00:10:26
In [92]:
#groping events by experiment groups:
events_246_247 = users_246_247.pivot_table(index='event_name', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
events_246_247
Out[92]:
experiment_number event_name 246 247
0 CartScreenAppear 1266 1225
1 MainScreenAppear 2450 2447
2 OffersScreenAppear 1542 1506
3 PaymentScreenSuccessful 1200 1145
4 Tutorial 278 279

Now we are going to check if the groups 246 and 247 have any statistically significant difference.

First we are going to run A/A test for groups 246 and 247 to check if the data for A/B test is collected in a correct way.

A/A test

We are going to run a proportion z-test for the analysis as the requierments for it are met:

  • random sampling;
  • two possible outcomes for a sample point (success/falue);
  • at least 10 successes and 10 falures in a sample;
  • the population size is at least 20 times as big as the sample size.
In [93]:
#testing proportions
def check_hypothesis(group1,group2, event, alpha):
   
    successes1=events_246_247[events_246_247.event_name==event][group1].iloc[0]
    successes2=events_246_247[events_246_247.event_name==event][group2].iloc[0]
    
    #for trials we will go back to previous df with pre-aggregated data
    trials1=users_246_247[users_246_247.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247[users_246_247.experiment_number==group2]['user_id'].nunique()
    
    print( "Success for group", group1, "is",successes1, "for event", event,"\n",
          "Success for group", group2, "is",successes2, "for event", event,"\n",
         "Trials for group", group1, "is",trials1,"\n",
         "Trials for group", group2, "is",trials2,"\n")

The most popular event is MainScreenAppear, we will run the function check_hypothesis for this event first.
As this as an A/A test we are setting alpha to 1%. One of the criteria for a successful A/A test is as follows: "None of the key metrics varies by a statistically significant amount—typically by no more than 1%".

In [94]:
check_hypothesis(246,247, 'MainScreenAppear', alpha=0.01)
Success for group 246 is 2450 for event MainScreenAppear 
 Success for group 247 is 2447 for event MainScreenAppear 
 Trials for group 246 is 2484 
 Trials for group 247 is 2484 

Stating hypothesis.
The null hypothesis (H0) for the test: the proportions are the same.
The alternate hypothesis (H1): the proportions are not the same.

The whole function:

In [95]:
#testing hypothesis:
def check_hypothesis(group1,group2, event, alpha):
    #let's start with successes, using 
    successes1=events_246_247[events_246_247.event_name==event][group1].iloc[0]
    successes2=events_246_247[events_246_247.event_name==event][group2].iloc[0]
    
    #for trials we can go back to previous df with pre-aggregated data
    trials1=users_246_247[users_246_247.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247[users_246_247.experiment_number==group2]['user_id'].nunique()
    
    #proportion for success in the first group
    p1 = successes1/trials1

   #proportion for success in the second group
    p2 = successes2/trials2

    # proportion in a combined dataset
    p_combined = (successes1 + successes2) / (trials1 + trials2)

  
    difference = p1 - p2
    
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

  
    distr = st.norm(0, 1) 


    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for",event, 'and groups',group1, 'and' ,group2)
    else:
        print("Fail to Reject H0 for", event,'and groups',group1,'and' ,group2)  

Now we are going to run the function for the two groups comparing all the events.

In [96]:
for i in events_246_247.event_name.unique():
    check_hypothesis(246,247, i, alpha=0.01)
p-value:  0.24467189509416087
Fail to Reject H0 for CartScreenAppear and groups 246 and 247
p-value:  0.7198896906395109
Fail to Reject H0 for MainScreenAppear and groups 246 and 247
p-value:  0.29422290793569195
Fail to Reject H0 for OffersScreenAppear and groups 246 and 247
p-value:  0.11803233342550512
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 247
p-value:  0.9641335503925432
Fail to Reject H0 for Tutorial and groups 246 and 247

The data does not provide sufficient evidence, given the significance level we selected (1%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed every event does not differ (μ1 = μ2).
There is no statistically significant difference in results for A/A test (experiment groups 246 and 247). Thus we can prove that:

  • The results are not affected by anomalies or outliers in the statistical population
  • The tool for splitting traffic works correctly
  • Data is sent to analytical systems correctly

    Concequently we can move to analysis of A/B test results.

Do the same thing for the group with altered fonts. Compare the results with those of each of the control groups for each event in isolation. Compare the results with the combined results for the control groups. What conclusions can you draw from the experiment?

A/A/B test

Our samples size is big enough so we can run both t-test and z-test. Thus firstly we are going to use z-test and then check the calculations with t-test.

Proportion Z-test

Now we are going to form a general table for all 3 experiments with equal number of users.

In [97]:
#picking up the same number of user IDs as for the previous groups (2484)
users_248_raw_ID = df_complete[df_complete.experiment_number==248]['user_id'].unique().tolist()
users_248_ID = random.sample(users_248_raw_ID, k=2484)
In [98]:
#setting a df with requered visitors IDs (group 248 = group 247 =group 246)

users_248 = df_complete[df_complete['user_id'].isin(users_248_ID)]
users_248.sample()
Out[98]:
event_name user_id timestamp experiment_number event_date event_time
112905 PaymentScreenSuccessful 947773872779189288 2019-08-04 07:41:27 248 2019-08-04 07:41:27
In [99]:
#combining the 2 data sets: joining the two tables with all 3 experimrnt together (246, 247 and 248):
frames = [users_246_247, users_248]
users_246_247_248 = pd.concat(frames)
users_246_247_248.head()
Out[99]:
event_name user_id timestamp experiment_number event_date event_time
2826 Tutorial 3737462046622621720 2019-08-01 00:07:28 246 2019-08-01 00:07:28
2827 MainScreenAppear 3737462046622621720 2019-08-01 00:08:00 246 2019-08-01 00:08:00
2828 MainScreenAppear 3737462046622621720 2019-08-01 00:08:55 246 2019-08-01 00:08:55
2829 OffersScreenAppear 3737462046622621720 2019-08-01 00:08:58 246 2019-08-01 00:08:58
2832 OffersScreenAppear 3737462046622621720 2019-08-01 00:10:26 246 2019-08-01 00:10:26
In [100]:
#groping events by experiment groups:
events_246_247_248 = users_246_247_248.pivot_table(index='event_name', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
events_246_247_248
Out[100]:
experiment_number event_name 246 247 248
0 CartScreenAppear 1266 1225 1200
1 MainScreenAppear 2450 2447 2441
2 OffersScreenAppear 1542 1506 1493
3 PaymentScreenSuccessful 1200 1145 1152
4 Tutorial 278 279 272

We will use the same function to test hypothesis for A/B testing. As we are going to run 3 comparisons (246/247, 246/248, 248/247) we will alter the sinificance level respectively, using the Šidák method. Normally Alpha is set to 0.5 as a commonly accepted level in industry.

Calculating alpha with The Šidák method (guarantees family-wise error rate < ɑ):
the required significance level will be found as follows:

In [101]:
alpha_sidak = 1 - (1 - 0.05)**(1/3) #we have 3 groups which gives us 3 combinations for comparison (246/247, 246/248, 247/248)
round(alpha_sidak,4)
Out[101]:
0.017

Stating hypothesis (H0 states unchanged result):
H0: There is no difference between the groups A and B in number of users who performed the actions/ events
H1: There is a significant difference between the groups A and B in number of users who performed the actions/ events

In [102]:
def check_hypothesis(group1,group2, event, alpha):
    #let's start with successes, using 
    successes1=events_246_247_248[events_246_247_248.event_name==event][group1].iloc[0]
    successes2=events_246_247_248[events_246_247_248.event_name==event][group2].iloc[0]
    
    #for trials we can go back to previous df with pre-aggregated data
    trials1=users_246_247_248[users_246_247_248.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247_248[users_246_247_248.experiment_number==group2]['user_id'].nunique()
    
    #proportion for success in the first group
    p1 = successes1/trials1

   #proportion for success in the second group
    p2 = successes2/trials2

    # proportion in a combined dataset
    p_combined = (successes1 + successes2) / (trials1 + trials2)

  
    difference = p1 - p2
    
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

  
    distr = st.norm(0, 1) 


    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for",event, 'and groups',group1, 'and' ,group2)
    else:
        print("Fail to Reject H0 for", event,'and groups',group1,'and' ,group2)  
In [103]:
#Comparison of events for 247/248
for i in events_246_247_248.event_name.unique():
    check_hypothesis(247,248, i, alpha_sidak)
p-value:  0.47796423586593684
Fail to Reject H0 for CartScreenAppear and groups 247 and 248
p-value:  0.4988580370285636
Fail to Reject H0 for MainScreenAppear and groups 247 and 248
p-value:  0.7061208920679638
Fail to Reject H0 for OffersScreenAppear and groups 247 and 248
p-value:  0.8421127365642418
Fail to Reject H0 for PaymentScreenSuccessful and groups 247 and 248
p-value:  0.7518023684909965
Fail to Reject H0 for Tutorial and groups 247 and 248
In [104]:
#Comparison of events for 246/248
for i in events_246_247_248.event_name.unique():
    check_hypothesis(246,248, i, alpha_sidak)
p-value:  0.061094137726570796
Fail to Reject H0 for CartScreenAppear and groups 246 and 248
p-value:  0.301282451725573
Fail to Reject H0 for MainScreenAppear and groups 246 and 248
p-value:  0.15389481147656614
Fail to Reject H0 for OffersScreenAppear and groups 246 and 248
p-value:  0.17258720509668923
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 248
p-value:  0.7861612434359899
Fail to Reject H0 for Tutorial and groups 246 and 248
In [105]:
#Comparison of events for 246/247 (with changed significance level to alpha_sidak)
for i in events_246_247_248.event_name.unique():
    check_hypothesis(246,247, i, alpha_sidak)
p-value:  0.24467189509416087
Fail to Reject H0 for CartScreenAppear and groups 246 and 247
p-value:  0.7198896906395109
Fail to Reject H0 for MainScreenAppear and groups 246 and 247
p-value:  0.29422290793569195
Fail to Reject H0 for OffersScreenAppear and groups 246 and 247
p-value:  0.11803233342550512
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 247
p-value:  0.9641335503925432
Fail to Reject H0 for Tutorial and groups 246 and 247

The data does not provide sufficient evidence, given the significance level we selected (1.7%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed each of the event in all 3 groups (246, 247, 248) does not differ (μ1 = μ2 = μ3) at any stage of the funnel. Changing fonts does not have any influence on the users.

T-test

Now we are going to test the difference for every group in general to prove the previous calculations with t-test.

In [106]:
#forming a table
samples = users_246_247_248.pivot_table(index='event_date', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
samples.columns = ['date', '246', '247', '248']


samples
Out[106]:
date 246 247 248
0 2019-08-01 1191 1269 1207
1 2019-08-02 1197 1209 1195
2 2019-08-03 1140 1164 1141
3 2019-08-04 1170 1155 1161
4 2019-08-05 1269 1246 1243
5 2019-08-06 1270 1303 1300
6 2019-08-07 1198 1191 1207

Creating varables with the samples for each group.

In [107]:
sample_246 = samples.loc[:, samples.columns == '246'].values.tolist()
sample_246 = [i[0] for i in sample_246]
sample_246
Out[107]:
[1191, 1197, 1140, 1170, 1269, 1270, 1198]
In [108]:
results = st.shapiro(sample_246)
results
Out[108]:
ShapiroResult(statistic=0.8907630443572998, pvalue=0.2786675989627838)
In [109]:
sample_247 = samples.loc[:, samples.columns == '247'].values.tolist()
sample_247 = [i[0] for i in sample_247]
sample_247
Out[109]:
[1269, 1209, 1164, 1155, 1246, 1303, 1191]
In [110]:
sample_248 = samples.loc[:, samples.columns == '248'].values.tolist()
sample_248 = [i[0] for i in sample_248]
sample_248
Out[110]:
[1207, 1195, 1141, 1161, 1243, 1300, 1207]

Next we are going to define Variance for the 3 samples to define whether or not we can consider them equal for t-test.

In [111]:
variance_246 = np.var(sample_246)
print('Variance for 246 sample is ', variance_246.round(1))
Variance for 246 sample is  2011.4
In [112]:
variance_248 = np.var(sample_248)
print('Variance for 248 sample is ', variance_248.round(1))
Variance for 248 sample is  2365.3
In [113]:
variance_247 = np.var(sample_247)
print('Variance for 247 sample is ', variance_247.round(1))
Variance for 247 sample is  2612.5

The variances of the samples are not equal.
As we have samples of continuous data, the samples are drawn from a normally distributed data with close values of variances we will conduct an unpaired t-test to test the stated hypothesis (in Python it will be unpaired two-tailed T-test with 'equal_var = True').

Stating hypothesis (H0 states unchanged result):
H0: There is no difference between the two groups in number of users who performed the actions/ events
H1: There is a statistically significant difference between the two groups in number of users who performed the actions/ events

In [114]:
# "equal_var = False" as previous calculations proved that the 3 samples have different variance value.
results_246_247 = st.ttest_ind(
    sample_246, 
    sample_247, equal_var = False)

results_247_248 = st.ttest_ind(
    sample_247, 
    sample_248, equal_var = False)

results_246_248 = st.ttest_ind(
    sample_246, 
    sample_248, equal_var = False)

#we are going to use alpha_sidak here as there were 3 comparisons made

print('p-value to compare the groups 246 and 247: ', results_246_247.pvalue)
print('p-value to compare the groups 247 and 248: ', results_247_248.pvalue)
print('p-value to compare the groups 246 and 248: ', results_246_248.pvalue)

if (results_246_247.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 246 and 247")
else:
    print("Null hypothesis not rejected for groups 246 and 247")

if (results_247_248.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 247 and 248")
else:
    print("Null hypothesis not rejected for groups 247 and 248")

if (results_246_248.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 246 and 248")
else:
    print("Null hypothesis not rejected for groups 246 and 248") 
p-value to compare the groups 246 and 247:  0.609377518272799
p-value to compare the groups 247 and 248:  0.6878673090111376
p-value to compare the groups 246 and 248:  0.9216198375359732
Null hypothesis not rejected for groups 246 and 247
Null hypothesis not rejected for groups 247 and 248
Null hypothesis not rejected for groups 246 and 248

Conclusion.

The data does not provide sufficient evidence, given the significance level we selected (1.7%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed each of the event in all 3 groups (246, 247, 248) does not differ (μ1 = μ2 = μ3).

All the tests (proportions z-test and t-test) run for 3 groups of experiment, with comparison of events and without spitting by events, prove that there is no statistically significant difference between the groups. Changing fonts have no impact on users.

General Conclusion

In course of data preprocessing we studied data with 244126 entries; 5 event types and 3 experimet groups for 7551 unique users.
We have:

  • converted it to a category type column with event names;
  • 'EventTimestamp': converted unix time format to datetime and created two separate columns with date of event and time of event;
  • dropped 413 duplicates (0.17%);
  • renamed the columns using clear names without upper case letters.

There are 5 types of events in the data set: almost half of the logs was on the MainScreen and the least popular type is Tutorial with only 0.4% of logs.
Number events per user vary greatly, from 1 to 2307; middle 50% of users had from 9 to 37 events; the average number of events per user is 20.
There are logs from 2019-07-25 to 2019-08-07 in the data set, but study of data distribution shows that the logs data is incomplete till 2019-07-31 inclusively. Thus we had to drop the data on logs till 01/08/2019 having lost 16 % of entries (2826 entries) and 19% of unique users.
Some technical issues with users' logs could account for that.
The filtered data still has users from all three experimental groups.

There are 4 events forming the funnel, the most frequent one is MainScreen(49% of logs) , it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen(14%). The least popular event (Tutorial event with 0.4% of logs) is not a necessary step of the funnel and we do not include it in funnel calculations.
There are only 466 unique users that went through all of the events ( 6.2% of all users). If we omite Tutorial event as the least popular one, the number of users that performed all the events will be much higher, 3429 users.
As to events that have been performed at least once MainScreen is the leader, 37% of users viewd this screen at least once. Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4% of users.
84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it. To the contrary, only 6% of users that got to the MainScreen never came back, all the others (94%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viewed once by 9-10% of all users, thus 90-91% of users came back to this screens more than once.

The funnel is calculated for the following sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful as Tutorial obviously is not part of the main funnel.
We have got the following numbers: MainScreen visitors: 7419 and paid: 454. This way we got 6% conversion rate for this funnel. Other users might get converted going through another steps, because general number of users that got to Payment page is higher (3539 unique users).
The biggest share of users got lost on the stage of payment: only 26% of users that put the products to the cart eventually pay for them. This stage should be studed more thoroughly.
We can suggest the following possible issues to improve:

  • add more payment options (MasterCard/Visa/PayPal/payment from bank account, etc.);
  • add secure certificate to the payment page;
  • add special offer urging people to pay immediately (eg. discount, fast delivery options, courier tracking for purchase within 1 hour after adding to cart 1st product, etc.).

We have run an A/A/B test on 3 groups. Firstly we have proved that the test was designed correctly and there is no statistically significant difference between A/A groups (246 and 247).
Then we ran the same test comparing group 248 with 246 and 247. We randomly picked up the same sample size for each group and corrected significance level with Sidak method as we run multiple comparisons.
All tests proved that with given significance level we cannot prove that there is a difference between the tested groups. The samples did not provide sufficient evidence to conclude that the effect from altering fonts exists.

In [ ]: