The aim of the project is to investigate user behavior for the company's app (a startup that sells food products).
First we will preprocess the data and study the sales funnel:

find out how users reach the purchase stage
How many users actually make it to this stage
how many get stuck at previous stages

Next we will check the results of an A/A/B test: if change of the fonts for the entire app have any impact on users. We will pick up the equal number of users for all 3 groups and run a statistical test to check if there is a statistically significant difference in number of users for any of the events. </br>

Table of contents:¶

Part 1. Data preprocessing

Part 2. Data analysis
How many events are in the logs
How many users are in the logs
What's the average number of events per user
What period of time does the data cover.
Did you lose many events and users when excluding the older data

Part3. Event funnel
What events are in the logs and their frequency of occurrence
Number of users who performed each of these actions. Proportion of users who performed the action at least once.
In what order actions took place. Are all of them part of a single sequence?
Event funnel and share of users that proceed from each stage to the next. Ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B.

Part 4. Study of the results of experiment
How many users are there in each group
A/A test. Checking a statistically significant difference between samples 246 and 247
A/A/B test

General Conclusion

import pandas as pd
from IPython.display import display
from datetime import datetime
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import seaborn as sns
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
import math

Part 1. Data preprocessing ¶

next part

Reading the data base¶

#adjustments of the path for the Practicum platform:
df = pd.read_csv('./logs_exp_us.csv', delim_whitespace=True)
display(df.head())

display(df.sample(6))

Checking and changing the data types¶

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   EventName       244126 non-null  object
 1   DeviceIDHash    244126 non-null  int64 
 2   EventTimestamp  244126 non-null  int64 
 3   ExpId           244126 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 16.4 MB

#checking the object-type column if it can be turned to a category one. 
df['EventName'].value_counts()

MainScreenAppear           119205
OffersScreenAppear          46825
CartScreenAppear            42731
PaymentScreenSuccessful     34313
Tutorial                     1052
Name: EventName, dtype: int64

df['EventName'] = df['EventName'].astype('category')

#timestamp column needs to be converted, getting rid of unix time format
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'],unit='s')
display(df['EventTimestamp'])

0        2019-07-25 04:43:36
1        2019-07-25 11:11:42
2        2019-07-25 11:28:47
3        2019-07-25 11:28:47
4        2019-07-25 11:48:42
                 ...        
244121   2019-08-07 21:12:25
244122   2019-08-07 21:13:59
244123   2019-08-07 21:14:43
244124   2019-08-07 21:14:58
244125   2019-08-07 21:15:17
Name: EventTimestamp, Length: 244126, dtype: datetime64[ns]

#re-cheking memory usage
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   EventName       244126 non-null  category      
 1   DeviceIDHash    244126 non-null  int64         
 2   EventTimestamp  244126 non-null  datetime64[ns]
 3   ExpId           244126 non-null  int64         
dtypes: category(1), datetime64[ns](1), int64(2)
memory usage: 5.8 MB

The memory usage has dropped significatly: from 16.4 MB to 5.8 MB.

Adding separate columns with time and date¶

df['event_date']  = df['EventTimestamp'].dt.date
df['event_time'] = df['EventTimestamp'].dt.time
df.sample(6)

df['event_date'] = pd.to_datetime(df['event_date'], format='%Y-%m-%d')
#df['event_time'] = pd.to_datetime(df['event_time'], format='%H:%M:%S')

Renaming columns¶

#renaming columns
df.columns = ['event_name' ,'user_id', 'timestamp', 'experiment_number', 'event_date', 'event_time']
df.sample(6)

Checking for duplicated data¶

#Checking for duplicated data
df.duplicated().sum()

413

#checking what kind of duplicates we have
duplicated_rows = df[df.duplicated()]
duplicated_rows.sample(15)

#checking the dates when the data was duplicated:
duplicated_rows['timestamp'].dt.date.unique()

array([datetime.date(2019, 7, 30), datetime.date(2019, 7, 31),
       datetime.date(2019, 8, 1), datetime.date(2019, 8, 2),
       datetime.date(2019, 8, 3), datetime.date(2019, 8, 4),
       datetime.date(2019, 8, 5), datetime.date(2019, 8, 6),
       datetime.date(2019, 8, 7)], dtype=object)

So far we cannot see any pattern in duplicates. There are data on different events, dates (from 30.07.2019 till 7.08.2019), all experiment groups.

#checking area of the problem caused duplicated data:
for i in duplicated_rows:
    print(i, ":", duplicated_rows[i].nunique())

event_name : 5
user_id : 237
timestamp : 352
experiment_number : 3
event_date : 9
event_time : 352

The server had some problem duplicated data during all 5 events on 237 unique users with 352 timestamps in all 3 experiment groups.

Duplicated data is only 0.17% of the whole data base (413/244126) thus we can drop it without skewing the analysis.

#getting rid of duplicated data
df = df.drop_duplicates().reset_index(drop = True)

Checking for missing values¶

#Checking data for zeros:
for i in df.columns:
    print(i, len(df[df[i]==0]))

event_name 0
user_id 0
timestamp 0
experiment_number 0
event_date 0
event_time 0

# checking the number of experiments in every group: 
df['experiment_number'].value_counts()

248    85582
246    80181
247    77950
Name: experiment_number, dtype: int64

#statistical summary for categorical variables
df.describe(include=['category'])

#Identifing the number of unique users:
df['user_id'].value_counts().count()

7551

df['user_id'].nunique()

7551

df.shape

(243713, 6)

Conclusion¶

The data for analysis contains 4 rows with 244126 entries; data on 5 event types for 7551 unique users.
There are 5 unique EventName originally marked as object type. For the sake of memory usage we have converted it to a category type. "MainScreenAppear " is the most frequently appearing category (119205 ot of 244126 ).

Another type conversion has been proceeded with EventTimestamp. We have converted unix time format to datetime and created two separate columns with date of event and time of event.
Changing data type has allowed us to to improve the memory usage ( memory usage has dropped significatly: from 16.4 MB to 5.8 MB).
There were 413 duplicates (0.17%) in the data set that have been successfully dropped.
No missing values have been found in data set.
We have renamed the columns using clear names without upper case letters.
Resulting data set has 6 columns with 243713 rows.

back to top

Part 2. Data analysis ¶

How many events are there in the logs ¶

back to top

previous part

As we have just figured it out in the previous part see here, there are 5 types of events in the data/funnel. Below is the graphical interpretation of the data.

#creating a table 
df_events = df.groupby(['event_name']).agg({'user_id': 'count'}).reset_index()
#renaming columns
df_events.columns = ['event_name' ,'number']
df_events['event_name'] = df_events['event_name'].astype(str)
df_events.sort_values(by='number', ascending=False)

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(df_events.event_name, df_events.number, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='users')

ax.set_xticklabels(df_events.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(df_events["number"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Number of different types of events')
plt.show()

#pie chart for types of events
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'blue']
labels = df_events['event_name']
values = df_events['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Shares of different types of events')
fig.update_traces(hoverinfo='label+percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

We can see that almost half of the events are MainScreen; OfferScreen and Cart screen have almost equal shares of 19-17.5%. The least frequent event is watching tutotials. Either they do not have any value to the customers or it is hard to find.

How many users are in the logs ¶

back to top

previous part

df['user_id'].nunique()
print('There are', df['user_id'].nunique(), 'unique users in the data set.')

There are 7551 unique users in the data set.

What's the average number of events per user?¶

back to top

previous part

#calculating events per user
df_events_users = df.groupby(['user_id']).agg({'event_name': 'count'}).sort_values(by='event_name', ascending=False).reset_index()
df_events_users

#checking up how many unique events the unique users performed:
unique_events_per_users = df.groupby(['user_id'])['event_name'].nunique().reset_index().groupby(['event_name'])['user_id'].nunique()
unique_events_per_users

event_name
1    2707
2    1021
3     317
4    3035
5     471
Name: user_id, dtype: int64

Only 471 user had all 5 events.

df_events_users[['event_name']].describe()

#poltting distribution
plt.figure(figsize=(9,5))
sns.distplot(df_events_users['event_name']) 
 
plt.title('number of events per user')
plt.show()

# box-and-whisker plot
plt.figure(figsize=(9,5))
ax = sns.boxplot(x="event_name", data=df_events_users)

#Calculatinug the 95th and 99th percentiles for the number of events (setting out outliers). 
np.percentile(df_events_users['event_name'], [95, 99])

array([ 89. , 200.5])

The number of events per user vary from 1 to 2307. Middle 50% of users had from 9 to 37 events.

Not more than 5% of users had over 89 events and only 1% of chains had over 200 events. Thus users with over 200 events are outliers for our distribution.

Mean value is 32 events, median value is 20. As we have positively skewed distribution with a lot of outliers to the right, we will deem median value as an average value for the distribution. Thus average number of events per user is 20.

What period of time does the data cover? Find the moment at which the data starts to be complete and ignore the earlier section. What period does the data actually represent? ¶

back to top

previous part

df[['event_date', 'event_time']].describe()

df['timestamp'].min()

Timestamp('2019-07-25 04:43:36')

df['timestamp'].max()

Timestamp('2019-08-07 21:15:17')

The time covers period of 14 days starting from 2019-07-25 04:43:36 to 2019-08-07 21:15:17.

dates_views = df.groupby(['event_date']).agg({'user_id':'count'}).reset_index()
dates_views.columns = ['event_date' ,'events_number']
dates_views.reset_index(drop=True)

dates_views.plot(x='event_date', y='events_number', kind='bar', grid=True, figsize=(20,5), cmap='Set2', title='Number of Events per Date')

plt.show()

fig = px.area(dates_views, x='event_date', y="events_number", title='Number of Events per Date')
fig.show()

#slicing the data 
df_complete = df.query('event_date>="2019-08-01"')
df_complete['timestamp'].min()

Timestamp('2019-08-01 00:07:28')

df_complete['timestamp'].max()

Timestamp('2019-08-07 21:15:17')

Despite the fact that we have data starting from 25/07/2019 it is clear that the data is incomplete till 01/08/2019. That might have happened due to some technical reasons. Thus we have to slice the data base respectively. The actual data represents period of one week, from 2019-08-01 00:07:28 till 2019-08-07 21:15:17.

Did you lose many users and events when excluding the older data? ¶

df_excluded = df.query('event_date<"2019-08-01"')
df_excluded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2826 entries, 0 to 2825
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   event_name         2826 non-null   category      
 1   user_id            2826 non-null   int64         
 2   timestamp          2826 non-null   datetime64[ns]
 3   experiment_number  2826 non-null   int64         
 4   event_date         2826 non-null   datetime64[ns]
 5   event_time         2826 non-null   object        
dtypes: category(1), datetime64[ns](2), int64(2), object(1)
memory usage: 124.3+ KB

df_excluded['user_id'].nunique()

1451

In course of slicing we have lost 2826 entries(which is 1.16 % of the data base) and 1451 unique users (which is 19% of the data base).

#checking the groups:
df_complete.experiment_number.value_counts()

248    84563
246    79302
247    77022
Name: experiment_number, dtype: int64

#checking the events:
df_complete.event_name.unique()

[Tutorial, MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful]
Categories (5, object): [Tutorial, MainScreenAppear, OffersScreenAppear, CartScreenAppear, PaymentScreenSuccessful]

We still have users from all of 3 experiment groups as well as all the events in the filtered data set.

Conclusion.¶

There are 5 types of events in the data set: almost half of the logs was on the MainScreen, then go OfferScreen and Cart screen with slightly less than 20% each, PaymentScreen got 14% of logs and the least popular type is Tutorial with only 0.4% of logs. Number events per user vary greatly, from 1 to 2307; middle 50% of users had from 9 to 37 events; the average number of events per user is 20.
There are logs from 2019-07-25 to 2019-08-07 in the data set, but study of data distribution shows that the logs data is incomplete till 2019-07-31 inclusively. Thus we had to drop the data on logs till 01/08/2019. Due to the slice we have lost 1.16 % of enties (2826 entries) and 19% of unique users (1451 users IDs out of 7551 unique users ).
This shows that the data with unique users was not evenly distributed. Some technical issues with users' logs could account for that.
The filtered data still has users from all three experimental groups.

Event funnel ¶

[back to top](#top)

[previous part](#7)

What events are in the logs and their frequency of occurrence ¶

back to top

previous part

We will define frequency of every event type.

events = df_complete.groupby('event_name').agg({'user_id': ['count']}).reset_index()
events.columns = ['event_name','frequency'] #renaming columns
events = events.sort_values(by='frequency', ascending=False)
events.head(10)

fig, ax = plt.subplots(figsize=(12, 7))
ax.set_title(' frequency of events',fontsize=16)
plt.xlabel('Type of event',fontsize=12)
plt.ylabel('frequency',fontsize=12)
colors = ['lightblue', 'mediumturquoise', 'orange', 'lightgreen', 'plum', 'bisque','lavender',
          'lightcyan','palevioletred'] 
#plot
g = plt.bar(events.event_name, events.frequency,
        0.7, label='Events', color=colors[1], alpha=0.8, edgecolor='black',linewidth=2)

#annotation
bar_label = events.frequency.tolist() #values for text
bar_label = [str(label) for label in bar_label]
def autolabel(rects):
    for idx,rect in enumerate(g):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.02*height,
                bar_label[idx],
                ha='center', va='bottom', rotation=0, size=15)
        
autolabel(g)  

plt.legend() 
plt.show()

#pie chart for types of events

colors = ['lightblue', 'mediumturquoise', 'orange', 'lightgreen', 'plum', 'bisque','lavender',
          'lightcyan','palevioletred'] 

labels = events['event_name']
values = events['frequency']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Shares of different types of events')
fig.update_traces(hoverinfo='label+text+value', textfont_size=20, 
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

According to our data there are 5 events, the most frequent one is MainScreen, it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen is 14% and the least popular is Tutorial event (0.4%).

Number of users who performed each of these actions. Proportion of users who performed the action at least once. ¶

back to top

previous part

Number of users who had each of these events¶

Firstly we will group the data by users ID and define the number of actions for every event for each user.

Then we will find the number of users who had each of these events by dropping those who have NaNs in any column.

df_complete['event_name'] = df_complete['event_name'].astype(str)

users_per_events = df_complete.pivot_table(index='user_id',columns='event_name',values='timestamp',aggfunc='count').reset_index()
users_per_events

#Picking up the users that had all the events
users_all_events = users_per_events[users_per_events.notnull().all(axis=1)].reset_index(drop=True)
users_all_events

users_all_events.shape

(466, 6)

#picking up all the users IDs that prformed all the actions and saving them in a separate list
user_id_all_events = users_all_events['user_id'].tolist()
user_id_all_events

[9841258664663090,
 29094035245869447,
 77364241990273403,
 84107839139059565,
 89078601045475920,
 98792428456080413,
 116454972706892511,
 133486548927612375,
 147514000526855717,
 170708003193483797,
 203838813641156129,
 223230608818939965,
 239795526067822384,
 284324717439504162,
 288947541653489764,
 292563824900182495,
 308940995223272529,
 355651113645516604,
 378761015309962378,
 396921694825627632,
 436215067153454698,
 459177279692283865,
 515870279844015150,
 518297313815559468,
 533356125006419558,
 557622585227857656,
 559149934052182668,
 582264770456673374,
 617758068537391848,
 623940637775064759,
 624637828747274142,
 638598994881293831,
 646027458777469832,
 653322154533311661,
 655561191595997473,
 671922718563083939,
 674541835027541643,
 730864889612327407,
 733544924898733660,
 733563556336010524,
 745032079672450813,
 780798538749079517,
 784865651125640705,
 809934975781585929,
 826328741041532622,
 837505346045732408,
 863114068617027014,
 865006356323074101,
 888617422919024415,
 907516373885804773,
 909836633434978322,
 928893924262115877,
 941603036904761274,
 943476888243120350,
 972102914695797995,
 1025423183050919111,
 1052317782824159991,
 1064069332576331836,
 1113177574547813281,
 1158644344098364192,
 1217537191439577589,
 1266450414894434600,
 1309234519709630135,
 1362092663681182910,
 1373199082330812873,
 1392555893172405118,
 1411471843730263943,
 1446068587914200066,
 1479224128115420988,
 1492262579192542642,
 1493051727801961177,
 1501908752948001879,
 1518481827927168526,
 1535041857858656812,
 1536833952192676536,
 1582018592022318519,
 1601876840816097347,
 1604220726746226764,
 1630601299114293624,
 1656055605943893565,
 1675308806875779310,
 1683981161505856484,
 1705097928192746012,
 1729553690289446391,
 1736444126983851292,
 1750006600386297037,
 1800462567619643092,
 1857029223517498437,
 1884616937397676465,
 1897865515518262253,
 1919610389675188301,
 1921021973150069353,
 1922717626667551198,
 1925633498102429867,
 1984254026148347245,
 2045616237532505490,
 2056649048651349082,
 2131179201240516914,
 2147812610911632982,
 2158185118578389474,
 2187914245613435519,
 2252923612776743772,
 2254821219381623694,
 2254831023563140137,
 2300292234025330845,
 2317120910450320490,
 2326645966387673457,
 2335509695522126029,
 2339185552301026694,
 2380401036382252064,
 2382232476374805470,
 2382591782303281935,
 2391591272127366891,
 2433579795068677370,
 2435066505702064487,
 2438520924615884865,
 2449861889676075616,
 2467833017081929342,
 2520797580915790298,
 2540662527663401258,
 2543916046456558866,
 2580223756192068322,
 2602421655779348355,
 2648901857674049202,
 2655003569661492360,
 2656704080941585379,
 2664251778992193918,
 2674429336791405101,
 2730869064219188185,
 2785869854396882528,
 2789808063375807388,
 2815358815454061015,
 2825531614269746200,
 2826841367164455095,
 2831935993739683325,
 2863001408302662673,
 2882506928148115110,
 2885844202303657350,
 2927366723795973070,
 2934598532887670123,
 2998654154534786354,
 3006207371161637636,
 3009181927597130538,
 3012866736489345415,
 3048551176412455623,
 3063248032210795834,
 3095493255635125557,
 3116041239619625045,
 3118947941510680801,
 3132175184219980626,
 3163109716415429772,
 3193214496697386946,
 3209107700543981369,
 3278779007623731800,
 3296122068432336013,
 3304099753638451815,
 3325965414119122960,
 3357533074851319011,
 3383558428248432049,
 3432974492367330155,
 3452801840013988266,
 3461821332318259564,
 3467820901514995349,
 3494457950643850294,
 3495371200460644295,
 3524225896421726669,
 3530408535473465435,
 3551574897384590025,
 3551628613719517043,
 3583643616592343673,
 3583793367476243471,
 3590237880935165984,
 3603221798976493506,
 3610536745613892312,
 3642576412734261225,
 3666940454338004242,
 3699851072758855337,
 3727026283458859340,
 3737462046622621720,
 3741232419948091674,
 3797068122842442849,
 3841080606985462509,
 3851785345612397300,
 3911101470526499807,
 3929083066121281051,
 3933210228218775681,
 3933415997375240038,
 3976695185007308383,
 3978277293793862524,
 4013129726324754028,
 4033058607059900421,
 4043409844133645660,
 4122971274797290089,
 4127885408060891243,
 4135346488014272392,
 4140772931377659202,
 4166558499317749276,
 4169814227533210428,
 4197536517730418733,
 4270075519805190262,
 4277982896725647346,
 4294735467443373934,
 4317941094862063746,
 4325877916710814063,
 4369662623769092250,
 4422920184752474035,
 4430756990024652847,
 4438811436010620371,
 4450322282753054807,
 4452986566566254241,
 4483332814045365748,
 4502974168843754195,
 4505932751780538579,
 4531819286005097531,
 4540724666797253101,
 4541002331784686531,
 4613635923158598610,
 4632680059380639630,
 4647441570935782596,
 4659199390928967478,
 4716741813371028384,
 4752520790520449030,
 4756008038721683263,
 4758223822167675176,
 4805046500663660664,
 4820610343436068485,
 4827377269497927948,
 4870358463848296984,
 4873876122944444943,
 4876403292056911122,
 4877161838950700944,
 4891111378981166101,
 4891397824124625593,
 4923306958232804249,
 4930649730653097635,
 4939283755029305754,
 4947898935541290421,
 4949721956109290387,
 4995105532329648960,
 5022646214748735876,
 5030737562854930656,
 5034536472312079357,
 5080403137071458745,
 5116837134502602653,
 5147280779493817978,
 5153489763805063987,
 5160333374588452378,
 5204904482314476306,
 5254900839726794643,
 5263436064683810729,
 5307812315857857498,
 5337309839474502499,
 5395446687580561656,
 5422877498318042547,
 5435933625972989901,
 5441587343016299285,
 5485754166608937705,
 5493099436862505022,
 5508858000645147218,
 5512830938322632746,
 5515378553085864642,
 5518942930638974593,
 5536820068885117452,
 5562465358645335882,
 5571251285435914458,
 5587388800188073787,
 5593644690011886950,
 5604501023679978892,
 5613225282971366197,
 5614748211953213901,
 5628367715793986203,
 5637074651270632739,
 5645768126178532765,
 5684052780286856198,
 5750624843460840227,
 5779903777733962844,
 5799531893223586458,
 5801285380328622427,
 5806039985642855096,
 5879533156262070838,
 5889126109087941811,
 5905051422086405927,
 5938984576771641234,
 5962451360444563367,
 5986037612469200496,
 6023865108912042440,
 6049785439366761614,
 6064595880165020619,
 6070970247353498510,
 6098605206430780449,
 6116464130294403607,
 6122934530478519801,
 6150090257971150017,
 6166065378622704404,
 6198307587240214960,
 6204944347213293942,
 6220847999332178356,
 6232492269159044807,
 6234930871538147758,
 6262841230086289949,
 6264364458061685488,
 6267771825452287061,
 6288445842539248947,
 6317926867000418796,
 6332565246182501322,
 6404569898667835532,
 6419712945560014019,
 6464348338527850554,
 6480357434506567791,
 6539959124219116392,
 6572690767876555543,
 6597768662667489475,
 6597983453793337280,
 6631757153124741841,
 6631893489958899803,
 6640752594751216025,
 6668887069054142551,
 6704598838011540725,
 6708141604972311142,
 6724375049518009324,
 6751906246875967649,
 6765073978511963806,
 6779562535677168482,
 6826515621767838631,
 6875369516670876232,
 6914750135397878804,
 6922632766372070464,
 6929928383094252416,
 6936477402888535239,
 6970800422549976561,
 6985989271562289673,
 6992868177979592023,
 7002515446915619058,
 7007003297763794436,
 7014863560555513828,
 7023620996158891358,
 7029396129325719562,
 7032048126210952560,
 7041730643912578183,
 7042152873032759236,
 7046379850003309712,
 7065239911766460173,
 7070555557075794622,
 7086945417901847497,
 7105359344065295137,
 7119918875908384799,
 7139541019841169923,
 7196016872209902268,
 7221645881620243073,
 7233701091713551965,
 7245610086945774223,
 7264599055576279549,
 7264733129254129302,
 7267375968060049678,
 7273808698005665705,
 7284403121311465049,
 7288884190716830543,
 7336313834595872271,
 7353567953905112235,
 7358857170038129131,
 7402609239155108606,
 7447837304039048449,
 7451823951764636968,
 7463871755604111263,
 7512600910434231438,
 7543879591140518446,
 7572795345577928939,
 7576574547629699356,
 7586870289028771156,
 7586949343857555269,
 7590170331667140164,
 7605717256678352765,
 7610224088593907423,
 7641365079564560002,
 7641805214591810162,
 7664408132782564716,
 7721164141360704808,
 7726349659490295219,
 7727825530379273053,
 7735210845999539372,
 7736080671173437017,
 7738058666231999878,
 7753775826753824773,
 7754794048474479150,
 7766250438405005043,
 7804999398574389599,
 7815258071961218075,
 7816427477498327101,
 7820584748549110105,
 7826449443626647741,
 7934103980026732017,
 7982384314356586506,
 7984672678297688740,
 7997218022319749424,
 8023623688519544332,
 8049055963688342379,
 8067489480903535024,
 8067897711025429953,
 8070558855125722674,
 8076549449581024043,
 8087553844272650070,
 8118046800480174342,
 8131554249272658553,
 8160667040437994655,
 8189122927585332969,
 8195030409729380348,
 8213817696376199484,
 8242035105429673674,
 8263969539306689447,
 8274231271397262591,
 8282840613247092336,
 8395621917387012808,
 8413558542049059754,
 8442928648791656830,
 8443840048179803128,
 8455446443714796356,
 8459660542115909970,
 8460712404206805701,
 8494595990824157619,
 8494763562519130801,
 8497782881180506913,
 8500974247653924940,
 8501796086661118753,
 8514231736313948445,
 8534398700246007081,
 8540811162439173094,
 8563831142507873525,
 8565250933994196510,
 8591356659592885582,
 8602226924705431908,
 8618336374025883585,
 8629169480120469572,
 8644931766545549368,
 8645642686473708628,
 8658456654739799479,
 8681724991928604124,
 8682440863173047512,
 8697620253425868959,
 8726336886471952976,
 8758422369781029411,
 8777962667596552465,
 8789279460922105131,
 8791890665368430794,
 8803711870189910977,
 8852708786245585243,
 8869888494132684245,
 8874849506441254133,
 8880121597063492612,
 8935589408717176617,
 8945952918647295565,
 8971576789159450109,
 8979537786938712883,
 8982013177812162195,
 9007552876922927579,
 9011905890878548886,
 9046271195406550844,
 9064787769109500946,
 9069283069873325900,
 9102150555208283035,
 9111078617775819038,
 9124010241211887806,
 9126019227179029392,
 9132678164084161445,
 9141951425188555065,
 9158074215019937335,
 9178917234910208523]

#number of unique users that performed all the events besides Tutorial.
users_all_events_without_tutorial =users_per_events.loc[:, users_per_events.columns != 'Tutorial']
users_all_events_without_tutorial = users_all_events_without_tutorial[users_all_events_without_tutorial.notnull().all(axis=1)].reset_index(drop=True)
users_all_events_without_tutorial.shape

(3429, 5)

This way we have received a table with user IDs that performed all the events, and a list of IDs of those users.
There are 466 users out of 7534 that performed all the events.

#Calculating number of sessions for each event
sums = users_all_events.select_dtypes(np.float).sum().rename('total')
# append sums to the data frame
events_users_total = users_all_events.append(sums)
events_users_total.tail(1)

MainScreenAppear and OffersScreenAppear have almost the same number of sessions, slightly less users got to CartScreen. Tutorial screen had the least number of sessions.

Sorting the events by the number of users.

We will group the data base by event name and calculate the number of unique users for each event.

#filtering out the data base containing only users that completed all events. 
events_all = df_complete[df_complete['user_id'].isin(user_id_all_events)]
events_all.head()

events_unique_users = df_complete.groupby('event_name').agg({'user_id': ['nunique']}).reset_index()
events_unique_users.columns = ['event_name','users_number'] #renaming columns
events_unique_users = events_unique_users.sort_values(by='users_number', ascending=False)
events_unique_users.head(10)

Visualising the calculations.

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_unique_users.event_name, events_unique_users.users_number, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='users')

ax.set_xticklabels(events_unique_users.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_unique_users["users_number"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Unique users per event')
plt.show()

The bar chart illustrates that the biggest amount of unique users performed 'MainScreenAppear', such events as 'OffersScreenAppear', 'CartScreenAppear' and 'PaymentScreenSuccessful' ah sapproximately the same amount of unique users (from 3,500 to 4,500 users). And the least popular event was 'Tutorial' with only 840 unique users visits.

Calculate the proportion of users who performed the action at least once.¶

Event performed at least once.¶

events_at_least_once = df_complete.groupby('event_name').agg({'user_id': ['nunique', 'count']}).reset_index()
events_at_least_once.columns = ['event_name','users_unique', 'total_users'] #renaming columns
events_at_least_once.head(10)

#calculating proportion of users who who performed the action at least once:
#share of users that performed event at least once in comparison to total number of users that performed the event:
events_at_least_once['users_once_proportion'] =(
    events_at_least_once['users_unique'] 
    / events_at_least_once['total_users']).round(2) 
#share of users that performed event at least once per each event
events_at_least_once['users_unique_proportion'] = (
    events_at_least_once['users_unique'] 
    /df_complete['user_id'].nunique()).round(2)  #corrected line
events_at_least_once = events_at_least_once.sort_values(by='users_once_proportion', ascending=False)
events_at_least_once

Visualising the calculations.

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_at_least_once.event_name, events_at_least_once.users_once_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_at_least_once.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_at_least_once["users_once_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action at least once (to total number of users that performed the event)')
plt.show()

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_at_least_once.event_name, events_at_least_once.users_unique_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_at_least_once.event_name)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_at_least_once["users_unique_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action at least once (to total number of once performed events)')
plt.show()

Among the events that performed at least once MainScreen is the leader with 37%.Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4%. 84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it.
To the contrary, only 2% of users that got to the MainScreen never came back, all the others (98%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viewed once by 50-60% of all users, thus 40-50% of users came back to this screens more than once.

Event performed only once.¶

Now we will define the proportion of users who performed every action at least once.
Firstly, we will define how many users performed the action once, then we will calculate the total number of users that performed every action and finally we will state the proportion.

#comning back to the the table of events/sessions per user
users_per_events

#number of users who performed the action CartScreenAppear once 
Users_one_CartScreenAppear = users_per_events.query('CartScreenAppear==1.0')['user_id'].count()
Users_one_CartScreenAppear

472

#total number of users who performed the action CartScreenAppear
Users_all_CartScreenAppear = users_per_events.query('CartScreenAppear>=1.0')['user_id'].count()
Users_all_CartScreenAppear

3734

#number of users who performed the action MainScreenAppear once 
Users_one_MainScreenAppear = users_per_events.query('MainScreenAppear==1.0')['user_id'].count()
Users_one_MainScreenAppear

246

Users_all_MainScreenAppear = users_per_events.query('MainScreenAppear>=1.0')['user_id'].count()
Users_all_MainScreenAppear

7419

#number of users who performed the action OffersScreenAppear once 
Users_one_OffersScreenAppear = users_per_events.query('OffersScreenAppear==1.0')['user_id'].count()
Users_one_OffersScreenAppear

681

Users_all_OffersScreenAppear = users_per_events.query('OffersScreenAppear>=1.0')['user_id'].count()
Users_all_OffersScreenAppear

4593

#number of users who performed the action PaymentScreenSuccessful once 
Users_one_PaymentScreenSuccessful = users_per_events.query('PaymentScreenSuccessful==1.0')['user_id'].count()
Users_one_PaymentScreenSuccessful

574

Users_all_PaymentScreenSuccessful = users_per_events.query('PaymentScreenSuccessful>=1.0')['user_id'].count()
Users_all_PaymentScreenSuccessful

3539

#number of users who performed the action Tutorial once 
Users_one_Tutorial = users_per_events.query('Tutorial==1.0')['user_id'].count()
Users_one_Tutorial

756

#total number of users who performed the action Tutorial
Users_all_Tutorial = users_per_events.query('Tutorial>=1.0')['user_id'].count()
Users_all_Tutorial

840

Now we will form a table with the received data.

lst = [['MainScreenAppear', Users_one_MainScreenAppear, Users_all_MainScreenAppear], 
       ['OffersScreenAppear', Users_one_OffersScreenAppear, Users_all_OffersScreenAppear],
       ['CartScreenAppear', Users_one_CartScreenAppear, Users_all_CartScreenAppear], 
       ['PaymentScreenSuccessful', Users_one_PaymentScreenSuccessful, Users_all_PaymentScreenSuccessful], 
       ['Tutorial', Users_one_Tutorial, Users_all_Tutorial]] 
    
users_events_proportion = pd.DataFrame(lst, columns =['events', 'users_one_event', 'all_users_event']) 
users_events_proportion

#calculating proportion of users who who performed the action once:
users_events_proportion['users_once_proportion'] =users_events_proportion['users_one_event'] / users_events_proportion['all_users_event']
users_events_proportion['users_once_proportion'] = users_events_proportion['users_once_proportion'].round(2)
users_events_proportion

Visualising the calculations.

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(users_events_proportion.events, users_events_proportion.users_once_proportion, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(users_events_proportion.events)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(users_events_proportion["users_once_proportion"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]*100), color='k', va ='bottom', rotation=0)

plt.title('Percentage of users that performed the action only once')
plt.show()

The chart shows that the least popular category is still Tutorial, 90% of users never come back to this page. The most popular page is MainScreen, only 3% of users saw it once and did not come back to it again.

In what order actions took place. Are all of them part of a single sequence? ¶

back to top

previous part

We suppose that for most of the users the funnel looks as follows: 'MainScreenAppear', 'OffersScreenAppear', 'CartScreenAppear', 'PaymentScreenSuccessful'.
'Tutorial' event is an annecessary step in this process and it can take place at avery stage or can not take place at all. We will not take it into account when calculating the funnel.

(Calculating the funnel for the sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful and the ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B.)

back to top

previous part

#table for calculating funnel
users = df_complete.query('event_name!="Tutorial"').pivot_table(
    index='user_id', 
    columns='event_name', 
    values=['timestamp'],
    aggfunc='min').reset_index()
users.columns = ['user_id', 'CartScreenAppear', 'MainScreenAppear', 'OffersScreenAppear', 'PaymentScreenSuccessful']
users.head()

#calculating number of users on each step of the funnel according to the time when the action took place
step_1 = ~users['MainScreenAppear'].isna()
step_2 = step_1 & (users['OffersScreenAppear'] > users['MainScreenAppear'])
step_3 = step_2 & (users['CartScreenAppear'] > users['OffersScreenAppear'])
step_4 = step_3 & (users['PaymentScreenSuccessful'] > users['CartScreenAppear'])

n_MainScreen = users[step_1].shape[0]
n_OffersScreen = users[step_2].shape[0]
n_CartScreen = users[step_3].shape[0]
n_PaymentScreen = users[step_4].shape[0]

print('Visitors:', n_MainScreen)
print('Viewed the offer:', n_OffersScreen)
print('Added product to a cart:', n_CartScreen)
print('Paid:', n_PaymentScreen)

Visitors: 7419
Viewed the offer: 4201
Added product to a cart: 1767
Paid: 454

In order to calculat funnel metrics and to build it we will use the Funnel() method. We will buit it on basis of the previous calculatuons (table events_unique_users).

#forming a table for the funnel chart

list1 = [['MainScreenAppear', n_MainScreen], 
       ['OffersScreenAppear', n_OffersScreen],
       ['CartScreenAppear', n_CartScreen], 
       ['PaymentScreenSuccessful', n_PaymentScreen]] 
    
events_unique_users_funnel = pd.DataFrame(list1, columns =['events', 'users_number']) 
events_unique_users_funnel

#calculating ratio of the users to the previous stage:
events_unique_users_funnel['ratio']= (events_unique_users_funnel['users_number']
                                      .div(events_unique_users_funnel['users_number'].shift(1))
                                      .round(2)
                                     )
events_unique_users_funnel.fillna(0)

events_unique_users_funnel['percentage_ch'] = events_unique_users_funnel['users_number'].pct_change()*100
events_unique_users_funnel.fillna(0)

To check the funnel

plt.figure(figsize=(12,5))

# plot barh chart 
ax = sns.barplot(events_unique_users_funnel.events, events_unique_users_funnel.percentage_ch, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="events", ylabel='Percentage of users')

ax.set_xticklabels(events_unique_users_funnel.events)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(events_unique_users_funnel["percentage_ch"].iteritems()):        
    ax.text(i ,v[1], "{:.1f}%".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Bar chart for events funnel (percentage of user losts on every stage)')
plt.show()

posx and posy should be finite values
posx and posy should be finite values

We can see that the share of users that do not make it to the next stage of the funnel is getting bigger with every step of the funnel, with the biggest losts on the last stage (74% of users that get to cart screen do not complete the payment).

Funnel

#building the funnel

fig = go.Figure(go.Funnel(y = events_unique_users_funnel.events, x = events_unique_users_funnel.users_number, 
                          opacity = 0.85, marker = {"color": ['#333160', '#5b5895', '#3ac7c3' , '#c6f7e7']}))
                                                 
fig.show()

The funnel chart illustrates the fact that the smallest percentage of users get lost on the very first step, 57% of users that got to the MainScreen will get to the OffersScreen.
On the next step 58% do not get to the CartScreen and get lost. And losts on the last step are the biggest, just 26% of users that got to CartScreen make successful payment.
This way 6% of users that got to the MainScreen will make a purchase.

Conclusion.¶

There are 5 events forming the funnel, the most frequent one is MainScreen, it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen is 14% and the least popular is Tutorial event (0.4%).
There are only 466 unique users that went through all of the events ( out of 7534 users, 6.2%). If we omite Tutorial event as the least popular one, the number of users that performed all the events will be much higher, 3429 users.
If we count number of sessions per event we will see that MainScreen event and OffersScreen have very close number of sessions (7400-7500), slightly less sessdions are there for Cart event (6400) and even less seeeions are there for the Payment event (4300).
As to events that have been performed at least once MainScreen is the leader, 37% of users viewd this creen at least once. Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4% of users.
84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it.
To the contrary, only 6% of users that got to the MainScreen never came back, all the others (94%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viwed once by 9-10% of all users, thus 90-91% of users came back to this screens more than once.

If we calculate the number of users that performed the event only once, we will see the least popular category is still Tutorial, 90% of users never come back to this page. The most popular page is MainScreen, only 3% of users saw it once and did not come back to it again.
Next we have calculated the funnel for the sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful as Tutorial obviously is not part of the main funnel.
We have got the following numbers: MainScreen visitors: 7419, viewed the offer page: 4201, added product to a cart: 1767, paid: 454. This way we got 6% conversion rate for this funnel. Other users might get converted going through another steps, because general number of users that got to Payment page is higher (3539 unique users).
The biggest share of users got lost on the stage of payment : only 26% of users that put the products to the cart eventually pay for them. This stage should be studed more thoroughly.
Possible issues to improve:

add more payment options (MasterCard/Visa/PayPal/payment from bank account, etc.),
add secure certificate to the payment page,
add special offer urging people to pay immediately (eg. discount, free returns, free shippng, etc for purchase within 1 hour ).

Part 4. Study of the results of experiment. ¶

back to top

previous part

How many users are there in each group ¶

#calculating the number of unique users for each group
experiment_groups = df_complete.groupby('experiment_number').agg({'user_id': 'nunique'}).reset_index()
experiment_groups

plt.figure(figsize=(9,5))

# plot barh chart 
ax = sns.barplot(experiment_groups.experiment_number, experiment_groups.user_id, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="experiment no.", ylabel='Number of users')

ax.set_xticklabels(experiment_groups.experiment_number)
for item in ax.get_xticklabels(): item.set_rotation(0)
for i, v in enumerate(experiment_groups["user_id"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=0)

plt.title('Number of users in each group')
plt.show()

We have 3 groups with almost equal number of users. Still there is a difference that might skew the further results. We will pick up 2484 users randomly with sample() method.

A/A test. Checking a statistically significant difference between samples 246 and 247 ¶

#checking criteria for A/A tests (The number of users in different groups doesn't vary by more than 1%)
experiment_groupsAA = experiment_groups[experiment_groups.experiment_number!=248]
experiment_groupsAA.columns = ['experiment_number', 'users_number']
experiment_groupsAA['ratio']= (experiment_groupsAA['users_number']
                                      .div(experiment_groupsAA['users_number'].shift(1))
                                      .round(3)
                                     )
experiment_groupsAA

Number of users differ slightly more than by 1%, thus we will pick up 2484 users randomly with sample() method.

import random
users_247_raw_ID = df_complete[df_complete.experiment_number==247]['user_id'].unique().tolist()
users_247_ID = random.sample(users_247_raw_ID, k=2484)

display(users_247_ID) #we got list of 2484 unique user IDs for group 247

[5434797934758514583,
 7041725694001352181,
 8164821368561674670,
 4063469539476738161,
 8971576789159450109,
 5345027845561189180,
 3204810626486562900,
 5211717548134904196,
 2005314201504917278,
 3420371657225579529,
 3773590489309316043,
 4281479743095364057,
 2457909310456664267,
 6515929547164029764,
 3649034844806398516,
 2565014492448549049,
 4463574575890770992,
 80189413012306034,
 3808501295754195603,
 3437992091268681789,
 6996377124129943223,
 3885755640026707322,
 3232686571385637466,
 8559349295842617988,
 4052037525335308563,
 8957738111190945364,
 3300490215914857621,
 1561929312251625465,
 8573758417293729771,
 6900761739159412343,
 7635987661756527075,
 7586870289028771156,
 4455166999438232842,
 8071501282247284219,
 6998684233879494675,
 8750541469351821193,
 1379111342050013977,
 4171803676862750023,
 9116152523400748244,
 1529531054838639703,
 6070970247353498510,
 4411383770693565301,
 5911149890644227236,
 1084823616235555727,
 3929161457308552287,
 2192727391143803008,
 5156509171070673793,
 7427153236891171640,
 2709220656435554465,
 5657373150859017272,
 8056993866328083291,
 6821579127445036070,
 2981446198034507991,
 7176043484079675376,
 4438088437909683877,
 2460448352470841440,
 2785300585388283683,
 1741785633060377552,
 840247365306142821,
 248717507504048864,
 6690524421818999342,
 5511590840095686855,
 2032372810076622488,
 5006225713885460557,
 6260142250503168042,
 6343888653134816913,
 4342525543743792989,
 4598149768920567729,
 2374859313935223081,
 5287660122200561379,
 6561956771435011437,
 3191946297519275479,
 1891735264651988282,
 7060940278888962128,
 6598157157289930061,
 2956577131512679581,
 3746337140292909864,
 8111734619558428480,
 3077347585945449403,
 5440812812156911778,
 8516165470648523226,
 5406554203316670962,
 7660956866918094367,
 2095513126738999163,
 1738981368825823516,
 95206628315371849,
 2470999879924103731,
 483756982373773158,
 8026098042912885716,
 6093254895669475889,
 8276290564820986605,
 604405025443423456,
 4885257867404324290,
 7086174885550635991,
 6782197887982484285,
 7754650590239285010,
 6885047672362855775,
 1204908440567646625,
 6304358625369144825,
 3211698728226180240,
 1431362592250025125,
 1586969823172250754,
 8712466683546436835,
 6826908666924960365,
 8037263541858200996,
 8419358778925334550,
 6572690767876555543,
 1670211418297317122,
 6175313603189721946,
 3221676487280452576,
 8012620272409594699,
 4058609133338350099,
 6892072086341498458,
 3073362467512058984,
 5530248251237125348,
 97337799800097594,
 5178688081041235439,
 6483696045451304132,
 2631876180975097434,
 663948373672770349,
 8574414036177860633,
 350294727339902264,
 6146544816446732640,
 2882200197863522104,
 412269363258780118,
 3026872631070465498,
 4076638110861270577,
 4890328091489891497,
 8016534965909266602,
 3248933996331843734,
 2207627660352986049,
 1715800039370609036,
 4393383658902309167,
 227598988643318117,
 1599641448679221409,
 7156980657883694355,
 4513253165150520238,
 797066677504151142,
 3285958772927629473,
 6049698452889664846,
 6347164841918296916,
 742580055368777840,
 6994783392016081031,
 780798538749079517,
 3160905430797231576,
 4005352094921500182,
 9096238114140624904,
 5933072019017392890,
 3836530606841486281,
 1031265651214490703,
 1345903714229858630,
 3536055684618607603,
 780170411326456210,
 3551574897384590025,
 1225191038684666184,
 3484031245142231756,
 6446869071403350071,
 4348726185529335224,
 7815258071961218075,
 7524162245056732296,
 1335063544016004118,
 1803879332490491518,
 6159839327402499879,
 7789250019385955507,
 4480708264041321895,
 3796035091082529871,
 3582207040694686864,
 2928789257267509804,
 6609496381739008570,
 6163512637322707646,
 2207792128621220799,
 4962644996224402898,
 8515445940451596447,
 7478611464964500431,
 577703227672734143,
 9146630619602594005,
 3523280358013629980,
 772204008534156278,
 1861025584333564975,
 250419112136655315,
 1897338436453039992,
 2886396675297658624,
 5359844283694947197,
 1817601080103596916,
 3871721376096734504,
 5854783240198438284,
 8674975208611386264,
 4290402023926651764,
 7790656657113044342,
 4871079487372624585,
 5508858000645147218,
 2180779893563165231,
 1362092663681182910,
 8558593082514726822,
 3032338719854871114,
 7211145819222855510,
 6832504382338094957,
 8071397669512236988,
 741217858549528863,
 7525415361824974551,
 5510074219208192462,
 7512600910434231438,
 2844576250831680617,
 2750875175953778212,
 617306855803490709,
 5125359155373364263,
 1880319123864140842,
 4111480625662873403,
 9160437016685643194,
 3629657580710589645,
 4432111793939255075,
 8092568022639166769,
 3684378369323273051,
 596187849598934952,
 2617769254195403935,
 7213580412554817871,
 2874246907812105267,
 4596237489472252259,
 3093106157885860283,
 9196211109317457061,
 5085957470529806686,
 8329821754358160004,
 6504780035871286715,
 5682865148355002285,
 1879130487676858089,
 4607205867762114846,
 3911101470526499807,
 8729033948279972045,
 415393140403992081,
 4476597660224521966,
 7105359344065295137,
 8937198885181670270,
 1180716372926977637,
 9150135344042745280,
 8996651817113323223,
 7062405239100452586,
 8463802514068906041,
 6696922283717026581,
 28755862496905658,
 5236297813268426476,
 4035390462425372458,
 4823397055602788835,
 7536805326914916609,
 7153351231169442634,
 4619061466701287834,
 6972336377288863960,
 2088030388420733246,
 6075320301699028338,
 8945952918647295565,
 6404569898667835532,
 7717097133758101868,
 2089674818237535094,
 2256353318851497275,
 7583394010848707296,
 3248559101713000283,
 1559671681759212771,
 6787941801882943547,
 4020319949421921344,
 5533243256109303942,
 6799383944721464796,
 4449003779476868447,
 6746926855076274118,
 4699738035785774703,
 692448913275740475,
 4578652018867345590,
 7740948947612091870,
 2317120910450320490,
 89892936241398759,
 1262087367072548999,
 8532287688124147551,
 433700751848178337,
 7721936847161203073,
 3123721304325646854,
 1818888580792312774,
 6192195739359841186,
 6815658305565891028,
 7980064927529100411,
 948465712512390382,
 6252230086817013506,
 5943471568657248517,
 4891648599712893198,
 7701922487875823903,
 7604471555188247197,
 7311730072021703895,
 1143559367744842371,
 6140853948187673009,
 827216445999611892,
 8087553844272650070,
 5996659978193962314,
 5851462201183951457,
 9016099160389873252,
 8038978653690209956,
 537476820791202839,
 3193582066462778787,
 837542968599109939,
 8435417606546387032,
 8772235509545694181,
 5258511259371092883,
 4999260814175796103,
 8922233068786578578,
 980160307431341942,
 8885405957629088900,
 2754977819347976900,
 1635717576013174577,
 2310567447000311087,
 1566802255817691325,
 3138310861905027081,
 8347608816036482515,
 6446471938317348623,
 6761169916474158355,
 289228914209940039,
 9131593510407366655,
 2699157127198196365,
 7275750595741415851,
 62735763365307800,
 475525573494502277,
 2384329576990003082,
 3881891539348553720,
 4266455871264979429,
 3717416450580849209,
 2527682174074702466,
 2619428748601141020,
 6757980303432985206,
 6530789698008842148,
 2892388096807008990,
 4277982896725647346,
 5736280603796322581,
 3454683894921357834,
 1477475127703333342,
 2298594150724479628,
 8528069328405179526,
 8820704299342029288,
 4746496276455001678,
 9070527898450313539,
 6184735480528092692,
 1534051565705796373,
 1259262890422987763,
 1272316250977084199,
 4721204817879592643,
 1359329337915319993,
 4834512160020311565,
 7495249977808105846,
 1754140665440434215,
 3941321847932123889,
 2784928666669179848,
 1670499237495260603,
 4906589897545968296,
 4722282609251751140,
 281554451473665610,
 1526073585896640732,
 8739369198187851757,
 4633901593749805109,
 6480357434506567791,
 6160101665506526484,
 1414294169776581033,
 8840207768107775025,
 2626127786567605048,
 7056229000292242885,
 5918021282103598553,
 4608405459427890239,
 7300399240480510813,
 769261358925388390,
 4123270163943493691,
 8238253336270343940,
 4799289157564817423,
 4819017223835777121,
 70788136061563551,
 550970718764346073,
 4960808833493228395,
 5335957243538234701,
 474977146243168982,
 5435933625972989901,
 6537245976200136132,
 6589262034654130848,
 8053642028543447582,
 269309715259166977,
 7984672678297688740,
 5652604965383417961,
 1976242610478736343,
 944989533256675646,
 747706080571680169,
 2046817531179798625,
 5518942930638974593,
 1270866102649465227,
 3807928414254040303,
 3617745956307343362,
 6122171114561079367,
 7472834843463961465,
 1700856018280760225,
 5819359051161727651,
 741809788823411824,
 3978799009258242156,
 3015692701687158638,
 5794857060312167190,
 3590237880935165984,
 8294276109582672181,
 7486666725253402125,
 416669255233170069,
 7483068627836554893,
 5688900391652229132,
 5680622852857465310,
 3452784367893691732,
 4637708277741596490,
 1119518190440550684,
 305594859284003311,
 5556707202539887138,
 1051093549596532640,
 6144835774629021314,
 1483777619988800179,
 4432437748631167125,
 46236857309457992,
 8925830967888917007,
 1362955832619669880,
 4981103726341816137,
 2045565577415967787,
 1219817261256577411,
 8215030345264803518,
 7046787466259412749,
 4560747082510588984,
 7884230493916284844,
 1066075817123227424,
 1729553690289446391,
 744631185401435026,
 4671054288201913373,
 2031466986953527006,
 4878048079697205617,
 4502964825932212315,
 8987628949126410882,
 1126021718529336913,
 8610656671771567917,
 217208764195819861,
 4235693377656213108,
 916839831153708126,
 1426241881816094654,
 7534985591088335987,
 2391039980018424978,
 1755607441054582137,
 80958523570010555,
 3453963973976592528,
 3444764689234781918,
 7240090456049165417,
 1802961404368128531,
 7552870437360590821,
 3514113807094977722,
 7737348652733352322,
 8529577928172344956,
 5022646214748735876,
 5397421658181970872,
 626934752699266183,
 1044467853625099839,
 3322351381011335790,
 6945944578885432904,
 3382281125155660869,
 863855012668260540,
 6857784436429878445,
 6853642322140815386,
 5777376051033418618,
 5995977603596872012,
 175139370713921222,
 8420938487977262328,
 4001577265037386125,
 5075672656267947418,
 6017740127847509913,
 3299546918212252298,
 6191431714463699517,
 6166407559168000390,
 346062195302431477,
 8036336617022688206,
 6817319572438766341,
 4627698006712985403,
 8377201126969110825,
 7259191060769160805,
 2056649048651349082,
 375578807808788687,
 2958714898406832841,
 6888229484609913653,
 7133515764276214703,
 7759376189177788298,
 51078901260117593,
 4043451913990355083,
 934007775207699512,
 784865651125640705,
 689154991320046746,
 3583643616592343673,
 7623118146907918570,
 2191459293737538755,
 3493811924644018092,
 4340581176569788417,
 6669193273876458803,
 1110896155120934852,
 2664251778992193918,
 9042318653803239039,
 7304175668450257182,
 2451709971826885152,
 731093651646879622,
 3809364837620116089,
 3078650355523867432,
 6194835849195592223,
 4222410693437267960,
 6752579264606989757,
 6713602522619131191,
 5816289132615219593,
 3844996967009380096,
 4807704234455985232,
 7491260355484014001,
 7917934912290877422,
 5856305377951208909,
 7226720158127039572,
 7485950416208479438,
 5510996512847650643,
 3663291270136112912,
 1763234204193593145,
 1904650264626572363,
 184441571956328947,
 8162793946935005204,
 7377388476694839836,
 6702787959820350815,
 1203458062393959164,
 8742586081622553713,
 6203916031538615993,
 8855118873665049688,
 8415324821149653065,
 6336725403382175656,
 2500124605531025107,
 2819500441650310333,
 4336603908765108826,
 6055884364748871593,
 7666101191331672688,
 5144400385975959807,
 2194900799826633240,
 4286084469604846869,
 666058456191421549,
 2390375804834306115,
 7723866601686652745,
 6618793602854979814,
 7521107799334742020,
 7252846004949488535,
 5704598557533990208,
 8212606476487920439,
 4086924768271071570,
 1934820574417048407,
 32131536909829030,
 5180016769043519762,
 4877161838950700944,
 2176177620330762377,
 7638964351338328505,
 2126522902876821289,
 4043406323077089983,
 4358497452144799972,
 3990271807424232955,
 4882051544540389942,
 1097963955552992526,
 1319844410132864332,
 5527716311808271810,
 3455659692720637390,
 8790010330392108982,
 8686446138620178134,
 1194587399476440217,
 6431751553147691823,
 5408209889381861058,
 446020871492954696,
 1414726611892308127,
 2783631019518578088,
 1835021996251252097,
 6031379944093363239,
 3722204315779155652,
 8951411654526979325,
 3442485912315202630,
 1633561929294530840,
 3163827168022400125,
 7572795345577928939,
 8155357846588359356,
 417948724994764160,
 7180866956895127482,
 7606256489676361357,
 5788632145096365701,
 5691247363724041873,
 6340307811102438090,
 8110163251586006758,
 3670880358399219515,
 2882924403918584279,
 8551304834996164071,
 5964309771035274630,
 7111401858893279918,
 7372738292332898844,
 1355516522578982861,
 430485606768201570,
 2157460938665527156,
 7738145294063220121,
 7055060642607177836,
 195951501653378850,
 1452172630155482926,
 1066982802038964480,
 2380401036382252064,
 919386348696908415,
 1599801003111158560,
 7968994188742037558,
 6518516812680770359,
 7751439266407244878,
 596338584496047846,
 6116464130294403607,
 7042152873032759236,
 7533406193277938189,
 8878573080913900043,
 6270589772265885738,
 7644761080823176046,
 8224250491819188800,
 1926043475802601010,
 5837760886918767127,
 8030362956487106589,
 7437717400582892123,
 8740589539890522925,
 7664846897629303621,
 5510275687753597328,
 8517528459739984686,
 4703081243737778270,
 4004591705931404182,
 1047358240453682914,
 7857002844406318302,
 8256915806052729557,
 2677625909816111756,
 8108897223490300436,
 6254218025818064974,
 1978381084040678038,
 7225488958072414056,
 4333249048018318391,
 5801359697252720810,
 3034063235370171931,
 4857087115411142922,
 4417916511688687613,
 3425905933324520428,
 990229035205428080,
 5025956312643488924,
 8604753852490386123,
 7627124484909605677,
 5307812315857857498,
 8281485486574148605,
 1397725489131598078,
 2591399005740119082,
 3063248032210795834,
 2254821219381623694,
 8682440863173047512,
 8185899898731304456,
 5326768469896924337,
 7878253570769883982,
 2525867977967066505,
 8251111755333577908,
 2896702740085057040,
 821491663746261171,
 67153306548697605,
 8671477213618466470,
 7536911632465209937,
 8618336374025883585,
 7682434851425118819,
 3609155040662138444,
 787019918811618763,
 772246766314118047,
 971715444044470678,
 6463833207814436303,
 7873766874640172744,
 6203919060210406808,
 1861754747503374810,
 6921873124105213121,
 6279514806004087328,
 5010980729198725619,
 4415839030454055631,
 8681724991928604124,
 6117045252034262016,
 8863054179610410577,
 9182444363400946277,
 8898608114522007125,
 2458327509065800490,
 9037775697420265211,
 9187094690649224379,
 5545961230737458678,
 3155801589468317400,
 8886131408880611316,
 205567504954789840,
 2364106345024493331,
 9018466604094005093,
 2765695236497056545,
 3495939723068664022,
 470371526830309747,
 3317086698046307220,
 2823790631517317353,
 2965723328108616215,
 2558184766965628899,
 8139584978936500733,
 7638098162558108811,
 4280428108846588501,
 8985399978029402083,
 4212590716303110107,
 5677936338071938199,
 3069627736123337921,
 7310645527146535058,
 5509400195680108507,
 3812956721715932844,
 7724422949486786548,
 1886989438564499653,
 316545280028060457,
 4941099502335148638,
 8726413942744351490,
 887025819745007759,
 8224196328793260986,
 1390156457112470704,
 3163486935350844107,
 7547288962150307533,
 4599628364049201812,
 5924419344399155002,
 476584551711843059,
 3936804896547636799,
 6519468181324065430,
 4180967379477432277,
 1011982461038688201,
 2334662598911346748,
 3086139459284928882,
 5892436657040909299,
 5462686011002931011,
 8379429431611543492,
 3831448114075113519,
 1387165560212667162,
 6671123045946650915,
 3024197421055968640,
 5932841297723175385,
 8274231271397262591,
 7559713240699306882,
 9168350852884069895,
 2426020021473513274,
 6200904144834376698,
 8839027056358412603,
 7964402433748787296,
 8289578771469198941,
 1533994412892774248,
 5172124289118220039,
 8314293504792784061,
 2037546186463620828,
 5834318767027554059,
 2415188714574108464,
 1576055881633157090,
 2569440404673056428,
 7826656187815202969,
 4022914497222644222,
 2075008807303354325,
 7841679885317648012,
 7160091177059325015,
 5786852324768365446,
 3351264438703766360,
 3617783164572524320,
 6288445842539248947,
 8809171516767331956,
 6040191502175430751,
 8485802187894064416,
 1056436874470887468,
 8420911729771828510,
 3355425170283983440,
 2300264632290823791,
 496872847888674236,
 3620710510805782942,
 1000608915523616548,
 2080457149997549644,
 5774114902579549523,
 4643186033831435634,
 7275315765924178906,
 6956382183518365612,
 7094032985114029887,
 6241001729771846777,
 2372212476992240858,
 4939283755029305754,
 5750624843460840227,
 7868651256196435720,
 742351075583407193,
 4672591758778297334,
 5868681301281774423,
 6414331743009155825,
 8369105451845357811,
 8470902720347665438,
 8497782881180506913,
 3987249516670813491,
 4917023139263453733,
 1155270440642450625,
 1784088129523134891,
 1637183525585696535,
 8981353384232386376,
 3642568430494444728,
 4400195465524878466,
 4980045449118619005,
 439840184018318461,
 2426669476115918968,
 4502974168843754195,
 3682203984385477559,
 3250609420089827943,
 573618694510478945,
 260777406136889478,
 7939747556084232140,
 2475505403190905570,
 1152127394159350271,
 7970558449771374202,
 7708635796013762973,
 7007003297763794436,
 2658624974683477729,
 2485641541735752193,
 4706734030984523872,
 3746015443593251752,
 4114325253605643942,
 5558078826686084873,
 6471978249167861975,
 7196985028038498695,
 8934440297891246141,
 867539764377766519,
 8177272759144174786,
 6087347791668573225,
 3416181679258622505,
 2886628960044364745,
 4050125088571572193,
 8842868973507070873,
 6279438710893767649,
 7303879258190220659,
 8829156529073249712,
 6421557481015958273,
 3298560110921736687,
 2154133037313861585,
 4139601392661009783,
 4855320091095146156,
 4406813263279372508,
 5598134942196237435,
 2463410628813581862,
 6852550964038408882,
 2431455045762717569,
 6841693852773276907,
 5086454081285953956,
 5551578444617739349,
 7717030435243114250,
 4668832298353666743,
 3613025750924969400,
 4827775328614831155,
 1287843092903530435,
 5517827981829293782,
 298114340872823027,
 3855778861450874791,
 424858719121196997,
 7063180959056527272,
 6823759645951222686,
 1413091616777106146,
 2858792115038680508,
 3783170467492955597,
 8428777209401822375,
 6713384618608633696,
 4846621154567119913,
 4727437875646068398,
 1306039002894133978,
 7122410673383697031,
 6511166396888043920,
 1335983822544135331,
 7971505700009454369,
 1830149746426439847,
 4390837918625134446,
 8395621917387012808,
 4002214050220035919,
 4781569103211458267,
 5806050998431904024,
 2934598532887670123,
 2039979342104336040,
 2300292234025330845,
 8845318272622873556,
 4903282557313240250,
 4693745759546859515,
 7421562291363261516,
 7267375968060049678,
 8776491384529524344,
 6653592794823360382,
 4715449382983361650,
 3430716515170011775,
 865598932110304921,
 2382232476374805470,
 2516735914917326476,
 7280001668765324674,
 287920515800375800,
 4060672916668475018,
 6217295124800833842,
 229242238498134737,
 5751015140771966473,
 2265504054110070262,
 120464421886668120,
 4328861915111898325,
 5784203384430915788,
 4869735332644755950,
 3776850235681345880,
 5642087445710995915,
 8689033606257485077,
 4986758584945043091,
 8523590611747568369,
 6124233129910171828,
 3043287401310870279,
 4483078175460037818,
 625449045723710637,
 4407188961960390866,
 4020791024033346072,
 4899590676214355127,
 1501430434311491538,
 7092770789375305513,
 2882506928148115110,
 2011152258955517776,
 4608232147513646349,
 7400903796965575351,
 1373199082330812873,
 6054246013759836289,
 6129377181590250548,
 5951389957997725523,
 3351614451709259754,
 5986169048870188421,
 273012261803866429,
 2013880302311166256,
 62853461369047667,
 4491384517431481161,
 1640129893004506946,
 1496571865153692716,
 6119965954948365960,
 8419917136735884488,
 6936517579657223951,
 8652127031019540255,
 4169814227533210428,
 7548033668893710883,
 4330204663503526606,
 1556688949284346937,
 791585118237377148,
 8101588829907918400,
 4284716907364183621,
 3699634849210177019,
 7403794962485129716,
 3557017219662831140,
 5302119963407980855,
 8325194800804806038,
 4581794435043716955,
 6874363509598691386,
 4086772078235573322,
 1713241506527703337,
 7692637175775763691,
 7493492404943507506,
 1173352039090291667,
 8645506249912100546,
 8746576957702894468,
 2391591272127366891,
 3895296976043581348,
 997959739029313547,
 1703702289125317854,
 2531917347516408515,
 6979088364342662031,
 370831337408248129,
 1431571411773565676,
 2419517597868248322,
 4164610646466855086,
 115925259125407295,
 4718002964983105693,
 3867508215542095930,
 3275811698770626238,
 3671983803504319718,
 6011030719633223084,
 267043891540659265,
 402711886080464802,
 7340109038762980814,
 7027449428163177910,
 5113015789596468689,
 1355368547264388995,
 715803049899679304,
 1456479928857490458,
 2674429336791405101,
 3579996687545345346,
 1896534334106476174,
 557622585227857656,
 1478347681767261393,
 7188522580408167660,
 4553145270890751340,
 1623784750227488837,
 782163553814306342,
 4050331896361332537,
 4923844390335546332,
 7613906970022345231,
 6400775428226321371,
 1795271977194910088,
 9111985047183779356,
 1616856457908967248,
 6629734925182446179,
 3316969965060175629,
 4911304447140623726,
 1955131698085478194,
 7113552983611206319,
 104485797972356373,
 7696284448973569805,
 7505376122915623431,
 8347117036139943300,
 3783993475053044166,
 7824899892346910615,
 8657779206582872594,
 3187166762535343300,
 4647441570935782596,
 6122283482359719876,
 1477155831401052035,
 2984099745167047759,
 5300905056036505980,
 7746876408024632391,
 ...]

#ckeck
len(users_247_raw_ID)

2513

#ckeck up
len(users_247_ID)

2484

#setting a df with visitors from group 247 equal to group 246

#users_247 = df_complete.query('user_id == @users_247_ID')
users_247 = df_complete[df_complete['user_id'].isin(users_247_ID)]
users_247.sample()

users_246 = df_complete.query('experiment_number == 246')
display(users_246.sample())

#combining the 2 data sets: joining the two tables with 246 experiment visitors and 247 experiment visitors together
frames = [users_246, users_247]
users_246_247 = pd.concat(frames)
users_246_247.head()

#groping events by experiment groups:
events_246_247 = users_246_247.pivot_table(index='event_name', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
events_246_247

Now we are going to check if the groups 246 and 247 have any statistically significant difference.

Select the most popular event. In each of the control groups, find the number of users who performed this action. Find their share. Check whether the difference between the groups is statistically significant. Repeat the procedure for all other events (it will save time if you create a special function for this test). Can you confirm that the groups were split properly?¶

First we are going to run A/A test for groups 246 and 247 to check if the data for A/B test is collected in a correct way.

A/A test¶

We are going to run a proportion z-test for the analysis as the requierments for it are met:

random sampling;
two possible outcomes for a sample point (success/falue);
at least 10 successes and 10 falures in a sample;
the population size is at least 20 times as big as the sample size.

#testing proportions
def check_hypothesis(group1,group2, event, alpha):
   
    successes1=events_246_247[events_246_247.event_name==event][group1].iloc[0]
    successes2=events_246_247[events_246_247.event_name==event][group2].iloc[0]
    
    #for trials we will go back to previous df with pre-aggregated data
    trials1=users_246_247[users_246_247.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247[users_246_247.experiment_number==group2]['user_id'].nunique()
    
    print( "Success for group", group1, "is",successes1, "for event", event,"\n",
          "Success for group", group2, "is",successes2, "for event", event,"\n",
         "Trials for group", group1, "is",trials1,"\n",
         "Trials for group", group2, "is",trials2,"\n")

The most popular event is MainScreenAppear, we will run the function check_hypothesis for this event first.
As this as an A/A test we are setting alpha to 1%. One of the criteria for a successful A/A test is as follows: "None of the key metrics varies by a statistically significant amount—typically by no more than 1%".

check_hypothesis(246,247, 'MainScreenAppear', alpha=0.01)

Success for group 246 is 2450 for event MainScreenAppear 
 Success for group 247 is 2447 for event MainScreenAppear 
 Trials for group 246 is 2484 
 Trials for group 247 is 2484

Stating hypothesis.
The null hypothesis (H0) for the test: the proportions are the same.
The alternate hypothesis (H1): the proportions are not the same.

The whole function:

#testing hypothesis:
def check_hypothesis(group1,group2, event, alpha):
    #let's start with successes, using 
    successes1=events_246_247[events_246_247.event_name==event][group1].iloc[0]
    successes2=events_246_247[events_246_247.event_name==event][group2].iloc[0]
    
    #for trials we can go back to previous df with pre-aggregated data
    trials1=users_246_247[users_246_247.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247[users_246_247.experiment_number==group2]['user_id'].nunique()
    
    #proportion for success in the first group
    p1 = successes1/trials1

   #proportion for success in the second group
    p2 = successes2/trials2

    # proportion in a combined dataset
    p_combined = (successes1 + successes2) / (trials1 + trials2)

  
    difference = p1 - p2
    
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

  
    distr = st.norm(0, 1) 


    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for",event, 'and groups',group1, 'and' ,group2)
    else:
        print("Fail to Reject H0 for", event,'and groups',group1,'and' ,group2)

Now we are going to run the function for the two groups comparing all the events.

for i in events_246_247.event_name.unique():
    check_hypothesis(246,247, i, alpha=0.01)

p-value:  0.24467189509416087
Fail to Reject H0 for CartScreenAppear and groups 246 and 247
p-value:  0.7198896906395109
Fail to Reject H0 for MainScreenAppear and groups 246 and 247
p-value:  0.29422290793569195
Fail to Reject H0 for OffersScreenAppear and groups 246 and 247
p-value:  0.11803233342550512
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 247
p-value:  0.9641335503925432
Fail to Reject H0 for Tutorial and groups 246 and 247

The data does not provide sufficient evidence, given the significance level we selected (1%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed every event does not differ (μ1 = μ2).
There is no statistically significant difference in results for A/A test (experiment groups 246 and 247). Thus we can prove that:

The results are not affected by anomalies or outliers in the statistical population
The tool for splitting traffic works correctly
Data is sent to analytical systems correctly

Concequently we can move to analysis of A/B test results.

Do the same thing for the group with altered fonts. Compare the results with those of each of the control groups for each event in isolation. Compare the results with the combined results for the control groups. What conclusions can you draw from the experiment? ¶

A/A/B test¶

Our samples size is big enough so we can run both t-test and z-test. Thus firstly we are going to use z-test and then check the calculations with t-test.

Proportion Z-test¶

Now we are going to form a general table for all 3 experiments with equal number of users.

#picking up the same number of user IDs as for the previous groups (2484)
users_248_raw_ID = df_complete[df_complete.experiment_number==248]['user_id'].unique().tolist()
users_248_ID = random.sample(users_248_raw_ID, k=2484)

#setting a df with requered visitors IDs (group 248 = group 247 =group 246)

users_248 = df_complete[df_complete['user_id'].isin(users_248_ID)]
users_248.sample()

#combining the 2 data sets: joining the two tables with all 3 experimrnt together (246, 247 and 248):
frames = [users_246_247, users_248]
users_246_247_248 = pd.concat(frames)
users_246_247_248.head()

#groping events by experiment groups:
events_246_247_248 = users_246_247_248.pivot_table(index='event_name', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
events_246_247_248

We will use the same function to test hypothesis for A/B testing. As we are going to run 3 comparisons (246/247, 246/248, 248/247) we will alter the sinificance level respectively, using the Šidák method. Normally Alpha is set to 0.5 as a commonly accepted level in industry.

Calculating alpha with The Šidák method (guarantees family-wise error rate < ɑ):
the required significance level will be found as follows:

alpha_sidak = 1 - (1 - 0.05)**(1/3) #we have 3 groups which gives us 3 combinations for comparison (246/247, 246/248, 247/248)
round(alpha_sidak,4)

0.017

Stating hypothesis (H0 states unchanged result):
H0: There is no difference between the groups A and B in number of users who performed the actions/ events
H1: There is a significant difference between the groups A and B in number of users who performed the actions/ events

def check_hypothesis(group1,group2, event, alpha):
    #let's start with successes, using 
    successes1=events_246_247_248[events_246_247_248.event_name==event][group1].iloc[0]
    successes2=events_246_247_248[events_246_247_248.event_name==event][group2].iloc[0]
    
    #for trials we can go back to previous df with pre-aggregated data
    trials1=users_246_247_248[users_246_247_248.experiment_number==group1]['user_id'].nunique()
    trials2=users_246_247_248[users_246_247_248.experiment_number==group2]['user_id'].nunique()
    
    #proportion for success in the first group
    p1 = successes1/trials1

   #proportion for success in the second group
    p2 = successes2/trials2

    # proportion in a combined dataset
    p_combined = (successes1 + successes2) / (trials1 + trials2)

  
    difference = p1 - p2
    
    
    z_value = difference / math.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))

  
    distr = st.norm(0, 1) 


    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for",event, 'and groups',group1, 'and' ,group2)
    else:
        print("Fail to Reject H0 for", event,'and groups',group1,'and' ,group2)

#Comparison of events for 247/248
for i in events_246_247_248.event_name.unique():
    check_hypothesis(247,248, i, alpha_sidak)

p-value:  0.47796423586593684
Fail to Reject H0 for CartScreenAppear and groups 247 and 248
p-value:  0.4988580370285636
Fail to Reject H0 for MainScreenAppear and groups 247 and 248
p-value:  0.7061208920679638
Fail to Reject H0 for OffersScreenAppear and groups 247 and 248
p-value:  0.8421127365642418
Fail to Reject H0 for PaymentScreenSuccessful and groups 247 and 248
p-value:  0.7518023684909965
Fail to Reject H0 for Tutorial and groups 247 and 248

#Comparison of events for 246/248
for i in events_246_247_248.event_name.unique():
    check_hypothesis(246,248, i, alpha_sidak)

p-value:  0.061094137726570796
Fail to Reject H0 for CartScreenAppear and groups 246 and 248
p-value:  0.301282451725573
Fail to Reject H0 for MainScreenAppear and groups 246 and 248
p-value:  0.15389481147656614
Fail to Reject H0 for OffersScreenAppear and groups 246 and 248
p-value:  0.17258720509668923
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 248
p-value:  0.7861612434359899
Fail to Reject H0 for Tutorial and groups 246 and 248

#Comparison of events for 246/247 (with changed significance level to alpha_sidak)
for i in events_246_247_248.event_name.unique():
    check_hypothesis(246,247, i, alpha_sidak)

p-value:  0.24467189509416087
Fail to Reject H0 for CartScreenAppear and groups 246 and 247
p-value:  0.7198896906395109
Fail to Reject H0 for MainScreenAppear and groups 246 and 247
p-value:  0.29422290793569195
Fail to Reject H0 for OffersScreenAppear and groups 246 and 247
p-value:  0.11803233342550512
Fail to Reject H0 for PaymentScreenSuccessful and groups 246 and 247
p-value:  0.9641335503925432
Fail to Reject H0 for Tutorial and groups 246 and 247

The data does not provide sufficient evidence, given the significance level we selected (1.7%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed each of the event in all 3 groups (246, 247, 248) does not differ (μ1 = μ2 = μ3) at any stage of the funnel. Changing fonts does not have any influence on the users.

T-test¶

Now we are going to test the difference for every group in general to prove the previous calculations with t-test.

#forming a table
samples = users_246_247_248.pivot_table(index='event_date', values='user_id', columns='experiment_number', aggfunc=lambda x: x.nunique()).reset_index()
samples.columns = ['date', '246', '247', '248']


samples

Creating varables with the samples for each group.

sample_246 = samples.loc[:, samples.columns == '246'].values.tolist()
sample_246 = [i[0] for i in sample_246]
sample_246

[1191, 1197, 1140, 1170, 1269, 1270, 1198]

results = st.shapiro(sample_246)
results

ShapiroResult(statistic=0.8907630443572998, pvalue=0.2786675989627838)

sample_247 = samples.loc[:, samples.columns == '247'].values.tolist()
sample_247 = [i[0] for i in sample_247]
sample_247

[1269, 1209, 1164, 1155, 1246, 1303, 1191]

sample_248 = samples.loc[:, samples.columns == '248'].values.tolist()
sample_248 = [i[0] for i in sample_248]
sample_248

[1207, 1195, 1141, 1161, 1243, 1300, 1207]

Next we are going to define Variance for the 3 samples to define whether or not we can consider them equal for t-test.

variance_246 = np.var(sample_246)
print('Variance for 246 sample is ', variance_246.round(1))

Variance for 246 sample is  2011.4

variance_248 = np.var(sample_248)
print('Variance for 248 sample is ', variance_248.round(1))

Variance for 248 sample is  2365.3

variance_247 = np.var(sample_247)
print('Variance for 247 sample is ', variance_247.round(1))

Variance for 247 sample is  2612.5

The variances of the samples are not equal.
As we have samples of continuous data, the samples are drawn from a normally distributed data with close values of variances we will conduct an unpaired t-test to test the stated hypothesis (in Python it will be unpaired two-tailed T-test with 'equal_var = True').

Stating hypothesis (H0 states unchanged result):
H0: There is no difference between the two groups in number of users who performed the actions/ events
H1: There is a statistically significant difference between the two groups in number of users who performed the actions/ events

# "equal_var = False" as previous calculations proved that the 3 samples have different variance value.
results_246_247 = st.ttest_ind(
    sample_246, 
    sample_247, equal_var = False)

results_247_248 = st.ttest_ind(
    sample_247, 
    sample_248, equal_var = False)

results_246_248 = st.ttest_ind(
    sample_246, 
    sample_248, equal_var = False)

#we are going to use alpha_sidak here as there were 3 comparisons made

print('p-value to compare the groups 246 and 247: ', results_246_247.pvalue)
print('p-value to compare the groups 247 and 248: ', results_247_248.pvalue)
print('p-value to compare the groups 246 and 248: ', results_246_248.pvalue)

if (results_246_247.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 246 and 247")
else:
    print("Null hypothesis not rejected for groups 246 and 247")

if (results_247_248.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 247 and 248")
else:
    print("Null hypothesis not rejected for groups 247 and 248")

if (results_246_248.pvalue < alpha_sidak):
    print("Null hypothesis rejected for groups 246 and 248")
else:
    print("Null hypothesis not rejected for groups 246 and 248")

p-value to compare the groups 246 and 247:  0.609377518272799
p-value to compare the groups 247 and 248:  0.6878673090111376
p-value to compare the groups 246 and 248:  0.9216198375359732
Null hypothesis not rejected for groups 246 and 247
Null hypothesis not rejected for groups 247 and 248
Null hypothesis not rejected for groups 246 and 248

Conclusion.¶

The data does not provide sufficient evidence, given the significance level we selected (1.7%), to reject the null hypothesis. Therefore, we can conclude that average number of users performed each of the event in all 3 groups (246, 247, 248) does not differ (μ1 = μ2 = μ3).

All the tests (proportions z-test and t-test) run for 3 groups of experiment, with comparison of events and without spitting by events, prove that there is no statistically significant difference between the groups. Changing fonts have no impact on users.

General Conclusion ¶

back to top

previous part

In course of data preprocessing we studied data with 244126 entries; 5 event types and 3 experimet groups for 7551 unique users.
We have:

converted it to a category type column with event names;
'EventTimestamp': converted unix time format to datetime and created two separate columns with date of event and time of event;
dropped 413 duplicates (0.17%);
renamed the columns using clear names without upper case letters.

There are 5 types of events in the data set: almost half of the logs was on the MainScreen and the least popular type is Tutorial with only 0.4% of logs.
Number events per user vary greatly, from 1 to 2307; middle 50% of users had from 9 to 37 events; the average number of events per user is 20.
There are logs from 2019-07-25 to 2019-08-07 in the data set, but study of data distribution shows that the logs data is incomplete till 2019-07-31 inclusively. Thus we had to drop the data on logs till 01/08/2019 having lost 16 % of entries (2826 entries) and 19% of unique users.
Some technical issues with users' logs could account for that.
The filtered data still has users from all three experimental groups.

There are 4 events forming the funnel, the most frequent one is MainScreen(49% of logs) , it is followd by OfferScreen and CartScreen(19% and 18%), PaymentScreen(14%). The least popular event (Tutorial event with 0.4% of logs) is not a necessary step of the funnel and we do not include it in funnel calculations.
There are only 466 unique users that went through all of the events ( 6.2% of all users). If we omite Tutorial event as the least popular one, the number of users that performed all the events will be much higher, 3429 users.
As to events that have been performed at least once MainScreen is the leader, 37% of users viewd this screen at least once. Then go the Offer, Payment and Cart events with almost equal shares (18-23%) and Tutorial event is the least popular with only 4% of users.
84% of all users that got to the Tutorial screen viewed it once. That means that only 16% of users came back to it. To the contrary, only 6% of users that got to the MainScreen never came back, all the others (94%) viewed this screen more than once. OfferScreen, PaymentScreen and CartScreen were viewed once by 9-10% of all users, thus 90-91% of users came back to this screens more than once.

The funnel is calculated for the following sequence of events MainScreenAppear → OffersScreenAppear → CartScreenAppear → PaymentScreenSuccessful as Tutorial obviously is not part of the main funnel.
We have got the following numbers: MainScreen visitors: 7419 and paid: 454. This way we got 6% conversion rate for this funnel. Other users might get converted going through another steps, because general number of users that got to Payment page is higher (3539 unique users).
The biggest share of users got lost on the stage of payment: only 26% of users that put the products to the cart eventually pay for them. This stage should be studed more thoroughly.
We can suggest the following possible issues to improve:

add more payment options (MasterCard/Visa/PayPal/payment from bank account, etc.);
add secure certificate to the payment page;
add special offer urging people to pay immediately (eg. discount, fast delivery options, courier tracking for purchase within 1 hour after adding to cart 1st product, etc.).

We have run an A/A/B test on 3 groups. Firstly we have proved that the test was designed correctly and there is no statistically significant difference between A/A groups (246 and 247).
Then we ran the same test comparing group 248 with 246 and 247. We randomly picked up the same sample size for each group and corrected significance level with Sidak method as we run multiple comparisons.
All tests proved that with given significance level we cannot prove that there is a difference between the tested groups. The samples did not provide sufficient evidence to conclude that the effect from altering fonts exists.

	EventName	DeviceIDHash	EventTimestamp	ExpId	event_date	event_time
189000	MainScreenAppear	1888138255084738292	2019-08-06 10:05:08	246	2019-08-06	10:05:08
200395	MainScreenAppear	5335957243538234701	2019-08-06 15:19:28	247	2019-08-06	15:19:28
236719	MainScreenAppear	4233725776879391824	2019-08-07 16:26:30	246	2019-08-07	16:26:30
151394	CartScreenAppear	3689505530499390250	2019-08-05 10:00:39	246	2019-08-05	10:00:39
38050	CartScreenAppear	4623191541214045580	2019-08-01 21:10:26	247	2019-08-01	21:10:26
223144	MainScreenAppear	4901615437367572701	2019-08-07 10:04:56	248	2019-08-07	10:04:56

	event_name	user_id	timestamp	experiment_number	event_date	event_time
102439	MainScreenAppear	4523373475875031235	2019-08-03 17:36:11	246	2019-08-03	17:36:11
91984	CartScreenAppear	4623191541214045580	2019-08-03 12:58:12	247	2019-08-03	12:58:12
15980	PaymentScreenSuccessful	5726813107141430693	2019-08-01 11:18:56	247	2019-08-01	11:18:56
195223	OffersScreenAppear	3436171726922480780	2019-08-06 13:20:48	248	2019-08-06	13:20:48
66929	OffersScreenAppear	7773551664554459042	2019-08-02 17:02:33	246	2019-08-02	17:02:33
239743	MainScreenAppear	955444249670387244	2019-08-07 17:55:10	248	2019-08-07	17:55:10

	event_name	user_id	timestamp	experiment_number	event_date	event_time
86150	CartScreenAppear	3447945165857834216	2019-08-03 10:36:16	247	2019-08-03	10:36:16
130558	MainScreenAppear	33176906322804559	2019-08-04 15:37:55	248	2019-08-04	15:37:55
124800	PaymentScreenSuccessful	7180477252978892964	2019-08-04 12:57:14	247	2019-08-04	12:57:14
137612	MainScreenAppear	8153398568531477841	2019-08-04 19:02:22	248	2019-08-04	19:02:22
117195	PaymentScreenSuccessful	6744475526048766609	2019-08-04 09:48:14	247	2019-08-04	09:48:14
60524	MainScreenAppear	3267675483921543552	2019-08-02 14:24:45	247	2019-08-02	14:24:45
97581	PaymentScreenSuccessful	3610536745613892312	2019-08-03 15:21:03	248	2019-08-03	15:21:03
173339	MainScreenAppear	3315376538884741797	2019-08-05 19:02:43	247	2019-08-05	19:02:43
193890	PaymentScreenSuccessful	6932517045703054087	2019-08-06 12:41:18	248	2019-08-06	12:41:18
81499	PaymentScreenSuccessful	1211886626193804462	2019-08-03 08:30:29	246	2019-08-03	08:30:29
172741	CartScreenAppear	7586870289028771156	2019-08-05 18:37:07	247	2019-08-05	18:37:07
12454	PaymentScreenSuccessful	5152160705477623487	2019-08-01 09:35:39	247	2019-08-01	09:35:39
190193	MainScreenAppear	2965031923404247543	2019-08-06 10:43:55	246	2019-08-06	10:43:55
21010	PaymentScreenSuccessful	4980045449118619005	2019-08-01 13:20:40	247	2019-08-01	13:20:40
220106	MainScreenAppear	7110398937204962158	2019-08-07 08:26:05	248	2019-08-07	08:26:05

	event_name
count	7551.000000
mean	32.275593
std	65.154219
min	1.000000
25%	9.000000
50%	20.000000
75%	37.000000
max	2307.000000

	user_id	CartScreenAppear	MainScreenAppear	OffersScreenAppear	PaymentScreenSuccessful
0	6888746892508752	NaT	2019-08-06 14:06:34	NaT	NaT
1	6909561520679493	2019-08-06 18:52:58	2019-08-06 18:52:54	2019-08-06 18:53:04	2019-08-06 18:52:58
2	6922444491712477	2019-08-04 14:19:40	2019-08-04 14:19:33	2019-08-04 14:19:46	2019-08-04 14:19:40
3	7435777799948366	NaT	2019-08-05 08:06:34	NaT	NaT
4	7702139951469979	2019-08-02 14:28:45	2019-08-01 04:29:54	2019-08-01 04:29:56	2019-08-02 14:28:45

	EventName	DeviceIDHash	EventTimestamp	ExpId
0	MainScreenAppear	4575588528974610257	1564029816	246
1	MainScreenAppear	7416695313311560658	1564053102	246
2	PaymentScreenSuccessful	3518123091307005509	1564054127	248
3	CartScreenAppear	3518123091307005509	1564054127	248
4	PaymentScreenSuccessful	6217807653094995999	1564055322	248

	EventName	DeviceIDHash	EventTimestamp	ExpId
62700	PaymentScreenSuccessful	4089267023360117805	1564758931	248
165178	MainScreenAppear	8544371504846351459	1565019278	246
215708	OffersScreenAppear	8273551061341945191	1565154626	248
136265	MainScreenAppear	6820216004207280160	1564942506	247
185221	MainScreenAppear	5476114506218237691	1565078994	247
65114	CartScreenAppear	6996377124129943223	1564762399	247

	event_name	number
1	MainScreenAppear	119101
2	OffersScreenAppear	46808
0	CartScreenAppear	42668
3	PaymentScreenSuccessful	34118
4	Tutorial	1018

	user_id	event_name
0	6304868067479728361	2307
1	197027893265565660	1998
2	4623191541214045580	1768
3	6932517045703054087	1439
4	1754140665440434215	1221
...	...	...
7546	7399061063341528729	1
7547	2968164493349205501	1
7548	8071397669512236988	1
7549	425817683219936619	1
7550	6888746892508752	1

	event_date	event_time
count	243713	243713
unique	14	69824
top	2019-08-01 00:00:00	12:29:55
freq	36141	19
first	2019-07-25 00:00:00	NaN
last	2019-08-07 00:00:00	NaN

	event_date	events_number
0	2019-07-25	9
1	2019-07-26	31
2	2019-07-27	55
3	2019-07-28	105
4	2019-07-29	184
5	2019-07-30	412
6	2019-07-31	2030
7	2019-08-01	36141
8	2019-08-02	35554
9	2019-08-03	33282
10	2019-08-04	32968
11	2019-08-05	36058
12	2019-08-06	35788
13	2019-08-07	31096

	event_name	frequency
1	MainScreenAppear	117328
2	OffersScreenAppear	46333
0	CartScreenAppear	42303
3	PaymentScreenSuccessful	33918
4	Tutorial	1005

event_name	user_id	CartScreenAppear	MainScreenAppear	OffersScreenAppear	PaymentScreenSuccessful	Tutorial
0	9841258664663090	8.0	10.0	8.0	2.0	1.0
1	29094035245869447	2.0	10.0	10.0	1.0	1.0
2	77364241990273403	7.0	68.0	17.0	5.0	1.0
3	84107839139059565	7.0	17.0	20.0	6.0	2.0
4	89078601045475920	9.0	9.0	10.0	4.0	1.0
...	...	...	...	...	...	...
461	9126019227179029392	4.0	13.0	11.0	2.0	1.0
462	9132678164084161445	2.0	14.0	2.0	2.0	1.0
463	9141951425188555065	5.0	7.0	12.0	4.0	1.0
464	9158074215019937335	1.0	3.0	8.0	1.0	1.0
465	9178917234910208523	4.0	19.0	26.0	4.0	1.0

	event_name	user_id	timestamp	experiment_number	event_date	event_time
2826	Tutorial	3737462046622621720	2019-08-01 00:07:28	246	2019-08-01	00:07:28
2827	MainScreenAppear	3737462046622621720	2019-08-01 00:08:00	246	2019-08-01	00:08:00
2828	MainScreenAppear	3737462046622621720	2019-08-01 00:08:55	246	2019-08-01	00:08:55
2829	OffersScreenAppear	3737462046622621720	2019-08-01 00:08:58	246	2019-08-01	00:08:58
2832	OffersScreenAppear	3737462046622621720	2019-08-01 00:10:26	246	2019-08-01	00:10:26

	event_name	users_number
1	MainScreenAppear	7419
2	OffersScreenAppear	4593
0	CartScreenAppear	3734
3	PaymentScreenSuccessful	3539
4	Tutorial	840

experiment_number	event_name	246	247	248
0	CartScreenAppear	1266	1225	1200
1	MainScreenAppear	2450	2447	2441
2	OffersScreenAppear	1542	1506	1493
3	PaymentScreenSuccessful	1200	1145	1152
4	Tutorial	278	279	272

	date	246	247	248
0	2019-08-01	1191	1269	1207
1	2019-08-02	1197	1209	1195
2	2019-08-03	1140	1164	1141
3	2019-08-04	1170	1155	1161
4	2019-08-05	1269	1246	1243
5	2019-08-06	1270	1303	1300
6	2019-08-07	1198	1191	1207

	event_name	users_unique	total_users	users_once_proportion	users_unique_proportion
4	Tutorial	840	1005	0.84	0.11
2	OffersScreenAppear	4593	46333	0.10	0.61
3	PaymentScreenSuccessful	3539	33918	0.10	0.47
0	CartScreenAppear	3734	42303	0.09	0.50
1	MainScreenAppear	7419	117328	0.06	0.98

	events	users_one_event	all_users_event	users_once_proportion
0	MainScreenAppear	246	7419	0.03
1	OffersScreenAppear	681	4593	0.15
2	CartScreenAppear	472	3734	0.13
3	PaymentScreenSuccessful	574	3539	0.16
4	Tutorial	756	840	0.90

	events	users_number	ratio
0	MainScreenAppear	7419	0.00
1	OffersScreenAppear	4201	0.57
2	CartScreenAppear	1767	0.42
3	PaymentScreenSuccessful	454	0.26

Table of contents:¶

Part 1. Data preprocessing ¶

Reading the data base¶

Checking and changing the data types¶

Adding separate columns with time and date¶

Renaming columns¶

Checking for duplicated data¶

Checking for missing values¶

Conclusion¶

Part 2. Data analysis ¶

How many events are there in the logs ¶

How many users are in the logs ¶

What's the average number of events per user?¶

What period of time does the data cover? Find the moment at which the data starts to be complete and ignore the earlier section. What period does the data actually represent? ¶

Did you lose many users and events when excluding the older data? ¶

Conclusion.¶

Event funnel ¶

What events are in the logs and their frequency of occurrence ¶

Number of users who performed each of these actions. Proportion of users who performed the action at least once. ¶

Number of users who had each of these events¶

Sorting the events by the number of users.

Calculate the proportion of users who performed the action at least once.¶

Event performed at least once.¶

Event performed only once.¶

In what order actions took place. Are all of them part of a single sequence? ¶

Event funnel and share of users that proceed from each stage to the next. Ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B. ¶

Conclusion.¶

Part 4. Study of the results of experiment. ¶

How many users are there in each group ¶

A/A test. Checking a statistically significant difference between samples 246 and 247 ¶

A/A test¶

Do the same thing for the group with altered fonts. Compare the results with those of each of the control groups for each event in isolation. Compare the results with the combined results for the control groups. What conclusions can you draw from the experiment? ¶

A/A/B test¶

Proportion Z-test¶

T-test¶

Conclusion.¶

General Conclusion ¶