In cource of the project we are going to analyse a list of hypotheses and results of A/B testing.
In the Part 1 of the project we will range the hypothesis that may help boost revenue of a big online store. To do this we will look into data that is stored in the "hypotheses" table and set priority to every hypothesis using different appoaches (ICE and RICE).
Next in the Part 2 of the Project we are going to analyse A/B test results that were collected duting 1-month experiment. We are going to look into the data stored in 2 tables: orders_us and visits_us. For the purpose of analysis we are going to split the collected data in 2 equal groups and calculate statistical significane of the difference in orders size and in conversion rates. This way we are going to check if the difference in the key metrics of these groups is due to random chance or it is due to the treatment.
In the end we are going to make conclusions if the experiment was a success, should we continue collecting data or should we stop it.
Part 1. Prioritizing Hypotheses
Part 2. A/B Test Analysis
Data Study
1. Cumulative revenue by group
2. Cumulative average order size by group
3. Relative difference in cumulative average order size for group B
compared with group A
4. Conversion rate
5. Scatter chart of the number of orders per user
6. 95th and 99th percentiles for the number of orders per user
7. Scatter chart of order prices
8. 95th and 99th percentiles of order prices
9. Statistical significance of the difference in conversion between the groups using the raw data
10. Statistical significance of the difference in average order size between the groups using the raw data
11. Statistical significance of the difference in conversion between the groups using the filtered data
12. Statistical significance of the difference in average order size between the groups using the filtered data
13. Decision based on the test results
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
from IPython.display import display
hypotheses = pd.read_csv('./hypotheses_us.csv', sep=';')
display(hypotheses)
#lowercase to all column names of hypotheses df
hypotheses.columns = map(str.lower, hypotheses.columns)
hypotheses.info()
hypotheses['hypothesis'].value_counts()
We are going to calculate ICE score the following way: (impact * confidence)/effort
#adding the ICE column to the table:
hypotheses['ICE'] = (hypotheses['impact']*hypotheses['confidence'])/hypotheses['effort']
#printing the result:
print(hypotheses[['hypothesis','ICE']].sort_values(by='ICE', ascending=False))
According to the ICE framework the most highly ranked hypothesis are no. 8 and 0: "Launch a promotion that gives users discounts on their birthdays" and "Add two new channels for attracting traffic. This will bring 30% more users".
This method is helpful if it’s important to understand how many of our customers a given feature will benefit. We are going to calculate RICE score the following way:
RICE score = (reachimpactconfidence)/effort
#adding the RICE column to the table:
hypotheses['RICE'] = (hypotheses['reach']*hypotheses['impact']*hypotheses['confidence'])/hypotheses['effort']
#printing the result:
print(hypotheses[['hypothesis','RICE']].sort_values(by='RICE', ascending=False))
Taking into account reach factor has changed the ranking dramatically. Now the top priority hypothesis are no. 7 and 2: "Add a subscription form to all the main pages. This will help you compile a mailing list" and "Add product recommendation blocks to the store's site. This will increase conversion and average purchase size".
#heat map for the ranged hypothesis:
hypotheses_ranged = hypotheses.drop(['hypothesis','reach', 'impact', 'confidence', 'effort'], axis=1)
sns.set(style='white')
plt.figure(figsize=(20, 3))
plt.title('Ranged hypotheses', fontsize =16)
sns.heatmap(hypotheses_ranged.T, annot=True, fmt='g', linewidths=1, linecolor='grey');
Hypotheses ranging looks different for the two frameworks. If we want to take into consideration as many factors as possible we should start testing the RICE top ranked hypotheses: no. 7, 2, 0, 6. This way we will prioritize the hypotheses that are going to reach the maximum number of customers.
#adjustments of the paths for the Practicum platform:
#orders = pd.read_csv('/datasets/orders_us.csv', sep=',')
#visits = pd.read_csv('/datasets/visits_us.csv', sep=',')
orders = pd.read_csv('./orders_us.csv', sep=',')
visits = pd.read_csv('./visits_us.csv', sep=',')
display(orders)
orders.info()
#changing the date type
orders['date'] = pd.to_datetime(orders['date'], format="%Y.%m.%d")
#checking if there are visitors that got to both groups, A and B
visitors_groups = orders.groupby('visitorId', as_index=False).agg({'group' : pd.Series.nunique})
visitors_duplicated = visitors_groups.query('group>1')
visitors_duplicated['visitorId'].count()
We have 58 visitors that got to both groups. We will exclude them from our db (as according to the A/B testing rules users can’t take a part in 2 groups) and then we will form equal groups A and B with random sample method.
#excluding visitors that got to 2 groups:
orders_filtered = orders.query('visitorId not in @visitors_duplicated.visitorId')
display(orders_filtered)
visitorsA = orders_filtered.query('group=="A"')
display(visitorsA)
visitorsA['visitorId'].nunique()
visitorsB_raw = orders_filtered.query('group=="B"')
display(visitorsB_raw)
visitorsB_raw['visitorId'].nunique()
As group B bigger than group A(445 vs 528 unique visitors), we will randomly pick up from this initial group the same number of visitors as in group A (=445).
import random
visitorsB_raw_ID = visitorsB_raw['visitorId'].values.tolist()
visitorsB_ID = random.sample(visitorsB_raw_ID, k=445)
display(visitorsB_ID) #we got list of 445 unique visitors IDs for group B
#setting a df with visitors from group B equal to group A
visitorsB = visitorsB_raw.query('visitorId == @visitorsB_ID')
display(visitorsB)
#joining the two tables with A visitors and B visitors together
frames = [visitorsA, visitorsB]
orders_filtered = pd.concat(frames)
orders_filtered
#moving on to "visits" df:
display(visits)
visits.info()
#changing the date type
visits['date'] = pd.to_datetime(visits['date'], format="%Y.%m.%d")
We are going to graph cumulative revenue by group and make conclusions.
# building an array with unique paired date-group values
datesGroups = orders_filtered[['date','group']].drop_duplicates()
display(datesGroups)
# getting aggregated cumulative daily data on orders revenue
ordersAggregated = datesGroups.apply(lambda x: orders_filtered[np.logical_and(orders_filtered['date'] <= x['date'], orders_filtered['group'] == x['group'])].agg({'date' : 'max', 'group' : 'max', 'transactionId' : pd.Series.nunique, 'visitorId' : pd.Series.nunique, 'revenue' : 'sum'}), axis=1).sort_values(by=['date','group'])
ordersAggregated.head()
# getting aggregated cumulative daily data on visits
#the number of distinct visits in the test group up to the specified date, inclusive
visitsAggregated = datesGroups.apply(lambda x: visits[np.logical_and(visits['date'] <= x['date'], visits['group'] == x['group'])].agg({'date' : 'max', 'group' : 'max', 'visits' : 'sum'}), axis=1).sort_values(by=['date','group'])
visitsAggregated.head()
# merging the two tables into one and giving its columns descriptive names
cumulativeData = ordersAggregated.merge(visitsAggregated, left_on=['date', 'group'], right_on=['date', 'group'])
cumulativeData.columns = ['date', 'group', 'orders', 'buyers', 'revenue', 'visits']
print(cumulativeData.head(5))
We have received cumulative data on revenue per day per every group. Now we are goinug to visualize it for further analysis.
# DataFrame with cumulative orders and cumulative revenue by day, group A
cumulativeRevenueA = cumulativeData[cumulativeData['group']=='A'][['date','revenue', 'orders']]
# DataFrame with cumulative orders and cumulative revenue by day, group B
cumulativeRevenueB = cumulativeData[cumulativeData['group']=='B'][['date','revenue', 'orders']]
# Plotting the group A revenue graph
plt.figure(figsize=(20,10))
plt.plot(cumulativeRevenueA['date'], cumulativeRevenueA['revenue'], label='A')
# Plotting the group B revenue graph
plt.plot(cumulativeRevenueB['date'], cumulativeRevenueB['revenue'], label='B')
plt.xticks(rotation=90)
plt.title('Cumulative Revenue Graph for A and B groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('Revenue Amount', fontsize=12)
plt.grid()
plt.legend()
plt.show()
From the start of experiment cumulative revenue for both groups slightly fluctuated around the same fugures but approximately after 18/08/2019 group B started loosing their positions and group A became the leader.
Till 18.08.2019 cumulative revenue for both groups slightly fluctuated around the same fugures and were almost equal. After 18.08.2019 the revenue of the group A rose significantly. It might have been caused by some big order or some non-recurrent expencive purchase made in this group. We will plot another graph to investigate the situation.
#We will plot average purchase size by groups. We'll divide cumulative revenue by the cumulative number of orders:
plt.figure(figsize=(20,10))
plt.plot(cumulativeRevenueA['date'], cumulativeRevenueA['revenue']/cumulativeRevenueA['orders'], label='A')
plt.plot(cumulativeRevenueB['date'], cumulativeRevenueB['revenue']/cumulativeRevenueB['orders'], label='B')
plt.xticks(rotation=90)
plt.title('Average Purchase Size for A and B groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('Purchase Size', fontsize=12)
plt.grid()
plt.legend()
plt.show()
Average purchase size of group B exceeded the one of group A from the very beginning of the experiment. Values for both groups had strong fluctuations during the first half of experiment; the fluctuations eased after 15/08/2019. After this date cumulative average order size for both groups stabilized around the same value (105-110). But bu the end of experiment group A slightly surpassed average purchase size of group B. Sharp peaks in the charts can be explained by some expencive or large orders. We will investigate them later.
We will plot a relative difference graph for the average purchase sizes. To do this we will form a new data frame with cumulative revenue for both groups merging two tables. Relative difference will be calculated as ratio between cumulative average order size of group B and group A.
We'll add a horizontal axis where ratio equals to zero (values of both groups are equal).
# gathering the data into one DataFrame
mergedCumulativeRevenue = cumulativeRevenueA.merge(cumulativeRevenueB, left_on='date', right_on='date', how='left', suffixes=['A', 'B'])
# plotting a relative difference graph for the average purchase sizes
plt.figure(figsize=(20,10))
plt.plot(mergedCumulativeRevenue['date'], (mergedCumulativeRevenue['revenueB']/mergedCumulativeRevenue['ordersB'])/(mergedCumulativeRevenue['revenueA']/mergedCumulativeRevenue['ordersA'])-1)
plt.xticks(rotation=90)
plt.grid()
# adding the X axis
plt.axhline(y=0, color='black', linestyle=':')
plt.title('Relative Difference graph for the average purchase sizes for A and B groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('Relative Purchase Size', fontsize=12)
plt.show()
Average purchase size of group B started to exceed group A after 03/08/2019 reaching its peak on 7/08/2019, then it lowered for several days from 12/082019 till 15/08/2019. After that date the relative difference stabilized around 0 anf slightly dropped after 27/08/2019. During the 2nd half of experiment the two groups did nit show much difference in average purchase size.
We will calculate conversion rate as the ratio of orders to the number of visits for each day. Next we will plot it.
#checking the date range:
mergedCumulativeRevenue['date'].describe()
We will calculate cumulative conversion and then a relative difference for the cumulative conversion rates
#calculating cumulative conversion as the ratio of orders to the number of visits for each day
cumulativeData['conversion'] = cumulativeData['orders']/cumulativeData['visits']
# selecting data on group A
cumulativeDataA = cumulativeData[cumulativeData['group']=='A']
# selecting data on group B
cumulativeDataB = cumulativeData[cumulativeData['group']=='B']
# plotting the graphs
plt.figure(figsize=(20,10))
plt.plot(cumulativeDataA['date'], cumulativeDataA['conversion'], label='A')
plt.plot(cumulativeDataB['date'], cumulativeDataB['conversion'], label='B')
plt.legend()
plt.xticks(rotation=90)
#plt.axis(['2019-08-01', '2019-08-31', 0, 0.015])
plt.grid()
plt.title('Cumulative Conversion for A and B groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('Conversion rate', fontsize=12)
plt.show()
The chart demonstrates that group A had much better conversion rate at the start of experiment (0.032 vs 0.02) but then it dropped significatly and the difference was not so dramatic afterwards. Fluctuations in conversion of one group corresponded with fluctuations of the second group, they had peaks and troughs at the same dates. Fluctuations eased by the end of experiment for both groups, the difference in conversion rates was not big but stable.
#plotting a relative difference graph for the cumulative conversion rates B/A:
mergedCumulativeConversions = cumulativeDataA[['date','conversion']].merge(cumulativeDataB[['date','conversion']], left_on='date', right_on='date', how='left', suffixes=['A', 'B'])
plt.figure(figsize=(20,10))
plt.plot(mergedCumulativeConversions['date'], mergedCumulativeConversions['conversionB']/mergedCumulativeConversions['conversionA']-1, label="Relative gain in conversion in group B as opposed to group A")
plt.legend()
plt.axhline(y=0, color='black', linestyle='--')
#plt.axhline(y=-0.1, color='grey', linestyle='--')
#plt.axis(["2019-08-01", '2019-08-31', -0.6, 0.6])
plt.xticks(rotation=90)
plt.grid()
plt.title('Relative difference graph for B/A groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('rate', fontsize=12)
plt.show()
#plotting a relative difference graph for the cumulative conversion rates A/B:
plt.figure(figsize=(20,10))
plt.plot(mergedCumulativeConversions['date'], mergedCumulativeConversions['conversionA']/mergedCumulativeConversions['conversionB']-1, label="Relative gain in conversion in group A as opposed to group B")
plt.legend()
plt.axhline(y=0, color='black', linestyle='--')
#plt.axhline(y=-0.1, color='grey', linestyle='--')
#plt.axis(["2019-08-01", '2019-08-31', -0.6, 0.6])
plt.xticks(rotation=90)
plt.grid()
plt.title('Relative difference graph for A/B groups', fontsize=18)
plt.xlabel('Time period', fontsize=12)
plt.ylabel('rate', fontsize=12)
plt.show()
These charts prove that conversion of group A surpasses the one of group B. Although the differece eased during the month of experiment it is still 5% higher for group A.
#creating a table with data on users and orders
ordersByUsers = orders_filtered.drop(['group', 'revenue', 'date'], axis=1).groupby('visitorId', as_index=False).agg({'transactionId' : pd.Series.nunique})
ordersByUsers.columns = ['visitorId','orders']
print(ordersByUsers.sort_values(by='orders',ascending=False).head(10))
#plotting chart
x_values = pd.Series(range(0, len(ordersByUsers)))
plt.scatter(x_values, ordersByUsers['orders'])
plt.title('Scatter chart of orders per visitor', fontsize=18)
plt.ylabel('Number of orders', fontsize=12)
plt.grid()
plt.show()
Most of the visitors made 1 order, just few of them made 2 and only several made 3 orders. To define the share of outliers we will calculate percentiles below.
Now we will look at the number of orders per buyer for every group and it's dynamics.
#calculating cumulative conversion as the ratio of orders to the number of visits for each day
cumulativeData['orders_per_users'] = cumulativeData['orders']/cumulativeData['buyers']
# selecting data on group A
cumulativeDataA = cumulativeData[cumulativeData['group']=='A']
# selecting data on group B
cumulativeDataB = cumulativeData[cumulativeData['group']=='B']
# plotting the scatter chart
plt.figure(figsize=(20,10))
x_values = pd.Series(range(0, len(cumulativeDataA['orders_per_users'])))
plt.scatter(x_values, cumulativeDataA['orders_per_users'], label='A')
plt.scatter(x_values, cumulativeDataB['orders_per_users'], label='B')
plt.title('Scatter chart of orders per user', fontsize=18)
plt.ylabel('Orders/users', fontsize=12)
plt.legend()
plt.grid()
plt.show()
At the beginning group A got 20% more orders per buyer; by the 23rd day of experiment the values stabilized around 1.05 orders per buyer for both groups.
#Calculate the 95th and 99th percentiles for the number of orders per user.
np.percentile(ordersByUsers['orders'], [95, 99])
#setting a variable for further calculations
too_many = int(np.percentile(ordersByUsers['orders'], [99]))
too_many
Not more than 5% of users placed more than 1 order, and not more than 1% of users made more than two orders. According to these calculations we can say that we have defined the point at which a data point becomes an anomaly: visitors with 3 orders per capita are outliers.
#histogram of order price distribution
plt.hist(orders_filtered['revenue'], log=True)
plt.title('order price distribution', fontsize=18)
plt.xlabel('Revenue per order', fontsize=12)
plt.grid()
plt.show()
We can see that most of the orders bring up to 1500. Still there are some outliers with revenue over 3000. To investigate it we will buld scatter plot.
#another chart to evaluate order prices:
x_values = pd.Series(range(0, len(orders_filtered['revenue'])))
plt.scatter(x_values, orders_filtered['revenue'])
plt.title('Scatter chart of revenue per order', fontsize=18)
plt.ylabel('Revenue', fontsize=12)
plt.grid()
plt.show()
The second chart proves that there are only several orders with revenue over 1000. They can be stated as outliers. In order to state the upper fence more precisely we will calculate the percentiles as well.
#Calculatinug the 95th and 99th percentiles for the order prices.
np.percentile(orders_filtered['revenue'], [95, 99])
#setting a variable for further calculations
too_expencive = int(np.percentile(orders_filtered['revenue'], [99]))
too_expencive
Not more than 5% of orders costed over 401, and not more than 1% of orders brought revenue over 830.30. Thus we can state that the point after which an order becomes an anomaly expencive is 830.30. Consequently orders with revenue over 830.30 can be viewed as outliers.
We are going to calculate statistical significance of the difference in conversion between the groups using the Mann-Whitney U test because we have independent data samples with quantative variables (nonparametric version of the Student t-test), also the data doesn't follow normal distribution. We can see that the data is not normally distributed on a histogram here
We'll also find the relative difference in average order size between the groups:
#splitting tables with data on two groups:
ordersByUsersA = orders_filtered[orders_filtered['group']=='A'].groupby('visitorId', as_index=False).agg({'transactionId' : pd.Series.nunique})
ordersByUsersA.columns = ['visitorId', 'orders']
ordersByUsersB = orders_filtered[orders_filtered['group']=='B'].groupby('visitorId', as_index=False).agg({'transactionId' : pd.Series.nunique})
ordersByUsersB.columns = ['visitorId', 'orders']
#Combining the two tables with data on visits to form samples
sampleA = pd.concat([ordersByUsersA['orders'],pd.Series(0, index=np.arange(visits[visits['group']=='A']['visits'].sum() - len(ordersByUsersA['orders'])), name='orders')],axis=0)
sampleB = pd.concat([ordersByUsersB['orders'],pd.Series(0, index=np.arange(visits[visits['group']=='B']['visits'].sum() - len(ordersByUsersB['orders'])), name='orders')],axis=0)
Stating hypothesis:
H0: there's not a statistically significant difference in conversion between the groups
H1: average conversion level differs between the groups
A critical statistical significance level will be set at 0.05 as it is a commonly accepted one in the industry for non-multiple testing.
alpha = .05
results = st.mannwhitneyu(sampleA, sampleB)[1]
print("{0:.3f}".format(results))
if (results< alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
We have calculeted the p-value by implification of mannwhitneyu() criterion.
We have received p-value greater than alfa thus with probability of 95% we cannot reject the null hypothesis and conclude that average conversion differs between the groups.
Given the significance level we selected (5%), we failed to reject the null hypothesis, the data does not provide sufficient evidence for it. Therefore, we cannot make conclusion that average order size for both groups is the same.
# Calculating the relative conversion gain for group B/A:
print("{0:.3f}".format(sampleB.mean()/sampleA.mean()-1)) #the relative loss of group B
We can't reject the null hypothesis that there's not a statistically significant difference in conversion between the groups. The relative loss of group B is 4.3%.
To calculate the statistical significance of the difference in the segments' average order size, we'll pass the data on revenue to the mannwhitneyu() criterion. We'll also find the relative difference in average order size between the groups.
Stating hypothesis:
H0: average order size between the groups is the same
H1: average order size differs between the groups
A critical statistical significance level will be set to 0.05 as it is a commonly accepted one in the industry for non-multiple testing.
alpha = .05
results = st.mannwhitneyu(orders_filtered[orders_filtered['group']=='A']['revenue'], orders_filtered[orders_filtered['group']=='B']['revenue'])[1]
print("{0:.3f}".format(results))
if (results< alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
#Calculating the relative difference in average order size between the groups (B/A):
print("{0:.3f}".format(orders_filtered[orders_filtered['group']=='B']['revenue'].mean()/orders_filtered[orders_filtered['group']=='A']['revenue'].mean()-1))
The p-value is considerably higher than 0.05, so there's no reason to reject the null hypothesis and conclude that average order size differs between the groups.
Nonetheless, the average order size for group B is 3% smaller then the order it is for the group A.
In order to filter the data we will determine the outliers: IDs of the users that made too big or too expencive orders. Then we will filter them out from the raw data and calculate the statisical significance the same way as in pp.9.
# we will find the ubnormal users' IDs:
# usrs that maid too many orders
usersWithManyOrders = pd.concat([ordersByUsersA[ordersByUsersA['orders'] > 2]['visitorId'], ordersByUsersB[ordersByUsersB['orders'] > 2]['visitorId']], axis = 0)
#users that paid too much
usersWithExpensiveOrders = orders_filtered.query('revenue > @too_expencive')['visitorId']
#joining the two groups together
abnormalUsers = pd.concat([usersWithManyOrders, usersWithExpensiveOrders], axis = 0).drop_duplicates().sort_values()
print(abnormalUsers.head(5))
print(abnormalUsers.shape) #defining total number of anomalous users.
We received a list of 18 user IDs that made either too many orders (more than 99% of other visitors) or placed too expencive orders (paid more than 99% of other visitors).
#filtering out the abnormal users from the raw data:
sampleAFiltered = pd.concat([ordersByUsersA[np.logical_not(ordersByUsersA['visitorId'].isin(abnormalUsers))]['orders'],pd.Series(0, index=np.arange(visits[visits['group']=='A']['visits'].sum() - len(ordersByUsersA['orders'])),name='orders')],axis=0)
sampleBFiltered = pd.concat([ordersByUsersB[np.logical_not(ordersByUsersB['visitorId'].isin(abnormalUsers))]['orders'],pd.Series(0, index=np.arange(visits[visits['group']=='B']['visits'].sum() - len(ordersByUsersB['orders'])),name='orders')],axis=0)
ordersByUsersA.head()
df_2=ordersByUsersA.head()
df_2.head()
sampleAFiltered_2 = pd.concat([df_2[np.logical_not(ordersByUsersA['visitorId'].isin(abnormalUsers))]['orders'],pd.Series(0, index=np.arange(visits[visits['group']=='A']['visits'].sum() - len(df_2['orders'])),name='orders')],axis=0)
sampleAFiltered_2.head(20)
visits.head()
sampleAFiltered.head()
sampleBFiltered
Now we will apply the statistical Mann-Whitney criterion to the resulting samples.
Keeping the same hypothesis and statistical significance level (0.05):
H0: there's not a statistically significant difference in conversion between the groups
H1: average conversion level differs between the groups
alpha = .05
results = st.mannwhitneyu(sampleAFiltered, sampleBFiltered)[1]
print("{0:.3f}".format(results))
if (results< alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
We got result that is very close to the previous one. For the data with filtered out outliers we still failed to reject the null hypothesis with the significance level of 5%.
Therefore, we cannot make conclusion that conversion level for both groups is different.
# Calculating the relative conversion gain for group B/A:
print("{0:.3f}".format(sampleBFiltered.mean()/sampleAFiltered.mean()-1)) #the relative loss of group B
Thus for the filtered data we can't reject the null hypothesis that there's not a statistically significant difference in conversion between the groups. And the relative loss of group B is even less now, 2%.
Keeping stated hypothesis and significance level:
H0: average order size between the groups is the same
H1: average order size differs between the groups
alpha = .05
results = st.mannwhitneyu(orders_filtered[np.logical_and(
orders_filtered['group']=='A',
np.logical_not(orders_filtered['visitorId'].isin(abnormalUsers)))]['revenue'], orders_filtered[np.logical_and(
orders_filtered['group']=='B',
np.logical_not(orders_filtered['visitorId'].isin(abnormalUsers)))]['revenue'])[1]
print("{0:.3f}".format(results))
if (results< alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
Filtering out outliers almost hasn't changed the outcome here.
#Calculating the relative difference in average order size between the groups (B/A):
print("{0:.3f}".format(
orders_filtered[np.logical_and(orders_filtered['group']=='B',np.logical_not(orders_filtered['visitorId'].isin(abnormalUsers)))]['revenue'].mean()/
orders_filtered[np.logical_and(
orders_filtered['group']=='A',
np.logical_not(orders_filtered['visitorId'].isin(abnormalUsers)))]['revenue'].mean() - 1))
We failed to reject the null hypothesis for the filtered data as well and now the difference between the segments has got even less, 2.9 % instead of 3.2%.
In course of work we have discovered the following facts:
- Neither raw nor filtered data revealed any statistically significant differences in conversion between the groups.
- Neither raw nor filtered data revealed any statistically significant differences in average order size between the groups.
- The graph showing the difference in average order size between the groups tells us that group B's results got much better at the beginning of experiment, then after period of fluctuations it stabalised at the same level as the ones of group A. By the end of experiment they were just about 5% lower than group A's (see the graph here [pic 1](#section_1)).
- The graph showing the difference in conversion between the groups tells us that group B's results are worse and don't seem to be improving significantly (see the graph here [pic 2](#section_2)).
Based on these facts, we can conclude that the test was unsuccessful. There difference in key metrics of the groups are not statistically significant. We can see no use in continuing the experiment as the calculations and graphs (either this [pic 1](#section_1) or this [pic 2](#section_2) ) clearly show that the result has stabalaized and the probability that segment B will turn out to be better than segment A is almost nonexistent.