In the course of the project we are going to analize and optimize marketing expenses of Yandex.Afisha. WE are going to look into data related to site visits, orders and costs from June 2017 through May 2018.
import pandas as pd
import plotly
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
from scipy import stats as st
import numpy
import seaborn as sns
df_visits = pd.read_csv('./visits_log_us.csv', sep=',')
df_orders = pd.read_csv('./orders_log_us.csv')
df_costs = pd.read_csv('./costs_us.csv')
df_visits.head()
df_visits.info(memory_usage='deep')
df_visits['Device'].value_counts()
#Changing data types
visits = pd.read_csv('./visits_log_us.csv',
dtype={'Device': 'category'},
parse_dates=['Start Ts', 'End Ts'])
visits.info(memory_usage='deep') #checking how much is memory usage now. We have freed up almost 40 MB.
#renaming columns
visits.columns = ['device' ,'end_ts', 'source_id', 'start_ts', 'uid']
visits.head()
df_orders.head()
df_orders.info(memory_usage='deep')
#Changing data types
orders = pd.read_csv('./orders_log_us.csv',
parse_dates=['Buy Ts'])
orders.info(memory_usage='deep') #checking how much is memory usage now. We have saved almost 1/3 of the memory used initially.
#renaming columns
orders.columns = ['buy_ts', 'revenue', 'uid']
orders.head()
orders.uid.value_counts()
visits.uid.value_counts()
df_costs.sample(6)
df_costs.info(memory_usage='deep')
#Changing data types
costs = pd.read_csv('./costs_us.csv',
parse_dates=['dt'])
costs.info(memory_usage='deep') #checking how much is memory usage now. We have saved over 1/2 of the memory used initially.
#lowercase to all column names of costs df
costs.columns = map(str.lower, costs.columns)
costs.head()
#stasistical summary of the costs data
costs.describe()
costs['costs'].median()
fig = px.box(costs, y="costs")
fig.show()
# Drawing plot of costs
plt.figure(figsize=(10,6), dpi= 80)
sns.kdeplot(costs['costs'], shade=True, color="orange", label="costs", alpha=.7)
# Decoration
plt.title('Density Plot of Costs', fontsize=22)
plt.xlabel('Costs amount', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.legend()
plt.show()
Costs are not normally distributed, distribution has a large tail to the right and lots of outliers. One wisker is longer than the other. Middle 50 % of costs is between 22 to 170 and standard deviation 156.3. There are almost no costs over 1000. Mean value is 129.477 and median value is 77.295.
#stasistical summary of the revenue data
orders['revenue'].describe()
orders['revenue'].median()
#stasistical summary of the orders data
fig = px.box(orders, y="revenue", title='Boxplot of Revenue')
fig.show()
# Drawing plot of revenue
plt.figure(figsize=(8,5), dpi= 80)
sns.kdeplot(orders['revenue'], shade=True, color="purple", label="revenue", alpha=.7)
# Decoration
plt.title('Density Plot of Revenue', fontsize=22)
plt.xlabel('Revenue amount', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.legend()
plt.show()
Revenue is not normally distributed also, we can see a large tail to the right and lots of outliers. One wisker is longer than the other. Middle 50 % of revenue is between 1.22 and 4.9 with standard deviation of 21.82. Mean value is 4.999 wherease median value is 2.5.
(visits.groupby('device')
.agg({'uid': 'count'})
.plot(y='uid', kind='bar', grid=True, figsize=(8,5), cmap='PiYG', title='Popularity of Different Device Types')
)
plt.show()
Afisha seems to be much more popular among those who prefer surfing with desktops.
#preparing data
costs_by_sources = costs.groupby('source_id')['costs'].sum().sort_values().reset_index()
#ploting chart
fig,ax=plt.subplots(figsize=(8,5))
ax.vlines(x=costs_by_sources.source_id, ymin=0, ymax=costs_by_sources.costs, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=costs_by_sources.source_id, y=costs_by_sources.costs, s=75, color='lightgreen', alpha=0.7)
ax.set_title('Costs spent no Different Ad Sources', fontdict={'size':16})
ax.set_ylabel('Costs Sum')
ax.set_xticks(costs_by_sources.source_id)
ax.set_xticklabels(costs_by_sources.source_id,rotation=90, fontdict={'horizontalalignment':'right','size':14})
ax.set_xlabel('Ad Sources')
for row in costs_by_sources.itertuples():
ax.text(row.source_id, row.costs+30,s=round(row.costs,2), rotation=0, fontdict={'horizontalalignment':'center','size':10})
Costs are not evenly disturbuted among ad sources; the company clearly prefers ad source no. 3 to the others.
Our initial data consisits of 3 tables with information on Afisha site visits (data on 228169 inique users making orders), costs and orders (with data on 36523 unique orders).
The tables have no missing values, vaues with information on date and time were converted to datetime format. "Device" column were set to category type. Changing object datatypes saved us a lot of memory on device.
Distribution of revenue and costs is positively scewed with lots of outliers.
Afisha has 7 main ad sources and the main one is no. 3, all the others are legging behind.
Afisha's visitors clearly prefer sktops to touch devices.
In this part of the project we will calculate the following metrics:
DAU — the number of daily active (unique) users
WAU — the number of weekly active users
MAU — the number of monthly active users
sticky factor = DAU/WAU
sticky factor = DAU/MAU
Also we will see them in dynamics.
visits['visit_year'] = visits['start_ts'].dt.year
visits['visit_month'] = visits['start_ts'].dt.month
visits['visit_week'] = visits['start_ts'].dt.week
visits['visit_date'] = visits['start_ts'].dt.date
visits.head()
dau_total = visits.groupby('visit_date').agg({'uid': 'nunique'}).mean()
wau_total = visits.groupby(['visit_year', 'visit_week']).agg({'uid': 'nunique'}).mean()
mau_total = visits.groupby(['visit_year', 'visit_month']).agg({'uid': 'nunique'}).mean()
print('Average number of daily active (unique) users is', int(dau_total), '.\nAverage number of weekly active (unique) users is', int(wau_total), '. \nAverage number of monthly active users is', int(mau_total), '.')
Now we will have a look at how MAU and WAU changed with the time.
#plotting MAU chart
mau = visits.groupby(['visit_year', 'visit_month']).agg({'uid': 'nunique'})
mau.plot(y='uid', kind='bar', grid=True, figsize=(8,5), cmap='PiYG', title='Number of Monthly Active Users')
plt.show()
MAU shows strong seasonal fluctuations: low season is in summer with the least active month in August and high season is in fall-winter with picks starting in October and ending in December. That can be explained by the fact that tickets to theaters are almost not sold during summer months as theater season usually ends in May and begins again in September - October. We can assume that theater tickets are the main product that visitors to the Afisha site buy there. Moreover population of large cities (the main audience of Afisha) tends to spend summer months outside of the cities and consequently is less active.
#plotting WAU chart
(visits.groupby(['visit_year', 'visit_week'])
.agg({'uid': 'count'})
.plot(y='uid', kind='bar', grid=True, figsize=(20,5), cmap='PiYG', title='Number of Weekly Active Users')
)
plt.show()
WAU parameter also demostrates seasonal fluctuations. But also we can see picks of activity in 2017 (on week 23, week 28, week 40, week 47) and in 2018: week 12. That can be explained by beginning of school vacations and opening of a theater season.
#calculating avarage values through the whole period 2017-2018.
dau_total = visits.groupby('visit_date').agg({'uid': 'nunique'}).mean()
wau_total = visits.groupby(['visit_year', 'visit_week']).agg({'uid': 'nunique'}).mean()
mau_total = visits.groupby(['visit_year', 'visit_month']).agg({'uid': 'nunique'}).mean()
print('Average number of daily active (unique) users is', int(dau_total), '.\nAverage number of weekly active (unique) users is', int(wau_total), '. \nAverage number of monthly active users is', int(mau_total), '.')
#calculating avarage values through the period 2017.
dau_2017 = visits.query('visit_year==2017').groupby('visit_date').agg({'uid': 'nunique'}).mean()
wau_2017 = visits.query('visit_year==2017').groupby(['visit_year', 'visit_week']).agg({'uid': 'nunique'}).mean()
mau_2017 = visits.query('visit_year==2017').groupby(['visit_year', 'visit_month']).agg({'uid': 'nunique'}).mean()
print('For 2017:\nAverage number of daily active (unique) users is', int(dau_2017), '.\nAverage number of weekly active (unique) users is', int(wau_2017), '. \nAverage number of monthly active users is', int(mau_2017), '.')
#calculating avarage values through the period 2018.
dau_2018 = visits.query('visit_year==2018').groupby('visit_date').agg({'uid': 'nunique'}).mean()
wau_2018 = visits.query('visit_year==2018').groupby(['visit_year', 'visit_week']).agg({'uid': 'nunique'}).mean()
mau_2018 = visits.query('visit_year==2018').groupby(['visit_year', 'visit_month']).agg({'uid': 'nunique'}).mean()
print('For 2018:\nAverage number of daily active (unique) users is', int(dau_2018), '.\nAverage number of weekly active (unique) users is', int(wau_2018), '. \nAverage number of monthly active users is', int(mau_2018), '.')
All 3 parameters demonstrate growth through the period 2017-2018. That means that marketing efforts bring their results.
Below we will calculate the sticky factor that characterizes audience retention and shows the probability of a newly acquired user to stay in the product. It also shows how good this product retains or “hooks” users.
#calculating sticky factors:
sticky_wau = dau_total/wau_total*100
print('Sticky factor for a week is', int(sticky_wau), '%.')
That means that for every new user that is brought to the site Afisha has a 15% shot at turning them into a daily user.
sticky_mau = dau_total/mau_total*100
print('Sticky factor for a month is', int(sticky_mau), '%.')
sticky_mau_2017 = dau_2017/mau_2017*100
print('Sticky factor for a month in 2017 is', int(sticky_mau_2017), '%.')
sticky_mau_2018 = dau_2018/mau_2018*100
print('Sticky factor for a month in 2018 is', int(sticky_mau_2018), '%.')
As both parameters DAU and WAu grew though the year sticky factor stayed at the level of 3%. This is the lowest value this metric can accept and it shows that users do not stay on the site and that it’s Retention Rate is, most likely, low and it doesn’t have a user base required to generate income. Researches show that there is a correlation between sticky factor and income, which supports the idea of this metric’s influence on product’s financial metrics.
In order to incease sticky factor the following can be helpful:
- better target the audience,
- create valuable and up-to-date content stimulating a user to come back,
- provide convenient interface with lots of useful instruments and features (better UX),
- convenient app,
- reminders about the product and useful changes (in the form of e-mails or push notifications).
daily_sessions = visits.groupby('visit_date').agg({'start_ts': 'count'}).describe()
daily_sessions
daily_sessions["start_ts"].median()
fig = px.box(daily_sessions, y="start_ts", title='Boxplot of daily sessions')
fig.show()
daily_sessions = visits.groupby('visit_date').agg({'start_ts': 'count'}).reset_index()
daily_sessions
#visualizing the results
fig = px.histogram(daily_sessions, x="start_ts", title='Distribution of daily sessions number')
fig.show()
In average site has 987 visits per day. Middle 50% is between 636 and 1280 visits per day, median value is 811 visits with standard deviation of 419. Distribution is not normal, the wiskers are not equal and there is a tail to the right with some outliers. Most of the distribution falls into interval between 300-400 and 1500-1600 visits.
We will have a look at the visits data in dynamic.
# Get the Peaks and Troughs
data = daily_sessions['start_ts'].values
doublediff = np.diff(np.sign(np.diff(data)))
peak_locations = np.where(doublediff == -2)[0] + 1
doublediff2 = np.diff(np.sign(np.diff(-1*data)))
trough_locations = np.where(doublediff2 == -2)[0] + 1
# Draw Plot
plt.figure(figsize=(16,10), dpi= 80)
plt.plot('visit_date', 'start_ts', data=daily_sessions, color='tab:blue', label='Daily Traffic')
plt.scatter(daily_sessions.visit_date[peak_locations], daily_sessions.start_ts[peak_locations], marker=mpl.markers.CARETUPBASE, color='tab:green', s=100, label='Peaks')
plt.scatter(daily_sessions.visit_date[trough_locations], daily_sessions.start_ts[trough_locations], marker=mpl.markers.CARETDOWNBASE, color='tab:red', s=100, label='Troughs')
# Annotate
for t, p in zip(trough_locations[1::5], peak_locations[::3]):
plt.text(daily_sessions.visit_date[p], daily_sessions.start_ts[p]+300, daily_sessions.visit_date[p], horizontalalignment='center', color='darkgreen')
plt.text(daily_sessions.visit_date[t], daily_sessions.start_ts[t]-300, daily_sessions.visit_date[t], horizontalalignment='center', color='darkred')
# Decoration
plt.ylim(1,4100)
#xtick_location = daily_sessions.index.tolist()[::2]
#xtick_labels = daily_sessions.visit_date.tolist()[::2]
#plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=90, fontsize=12, alpha=.7)
plt.title("Daily Visits 2017-2018", fontsize=22)
plt.yticks(fontsize=10, alpha=.7)
# Lighten borders
plt.gca().spines["top"].set_alpha(.0)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.0)
plt.gca().spines["left"].set_alpha(.3)
plt.legend(loc='upper right')
plt.grid(axis='y', alpha=.3)
plt.show()
Dynamic of number of daily sessions proves that users are more active during autumn and winter periods. Probably Afisha should add some additional products and offers to rise summer activity on the site.
Below we will check the daily dynamics for 2 types of devices, desktop and touch.
#Grouping the data and creation a pivot table for plotting the chart.
daily_sessions_device = visits.groupby(['visit_date', 'device']).agg({'start_ts': 'count'}).reset_index()
daily_sessions_device = daily_sessions_device.rename(columns={'start_ts': 'user_visits'})
device_pivot = daily_sessions_device.pivot(index='visit_date',columns='device',values='user_visits')
device_pivot.head()
device_pivot['desktop'].mean()
device_pivot['touch'].mean()
plt.plot(device_pivot)
plt.rcParams["figure.figsize"] = (18,13)
plt.title('Dynamics of daily sessions on different device types')
ax.set_ylabel('Number of sessions')
ax.set_xlabel('Dates')
plt.legend('DT', ncol=2, loc='upper left');
plt.figure(figsize=(23,3))
sns.heatmap(device_pivot.T, cmap="RdBu_r")
plt.title("Heatmap of daily sessions activity on different divice types", fontsize =16)
plt.show()
The number of visits through touch devices is significantly lower than the desktop ones as we saw it previously. General trends (such as seasonal peaks and traughs) are common for both device types. Although touch users demonstrate less fluctuation during the period.
print(device_pivot.query('touch==890'))
We have checked the day with maximum sessions number (24.11.2017) but have not found any specific reason for such abnormal activity in the news archieve. There may be a lot pf reasons including a failure in site logging system or Afisha might have run some special ad campaign or it have been mentioned somewhere. As this data does not have any significant impact on the purpose of our analysis we have left it out.
visits['session_length_sec'] = (visits['end_ts'] - visits['start_ts']).dt.seconds
visits['session_length_sec'].describe()
#calculating an average session length, ASL
print(visits['session_length_sec'].median())
visits['session_length_sec'].mode()
visits['session_length_sec'].hist(bins=30, log=True)
plt.title('Distribution of session length')
plt.show()
Middle 50% of sessions are from 120 to 840 sec. Magority of sessions have length up to 2000-3000 sec. Still the histogram shows that there are some outliers with sessions over 40,000 and 80,000 sec.
As customer behavior differs for touch devices and for desktops, we will check this parameter for 2 types of sessions separately.
visits_touch = visits.query('device=="touch"')
visits_touch['session_length_sec'].describe()
visits_desktop = visits.query('device=="desktop"')
visits_desktop['session_length_sec'].describe()
fig = px.box(visits, x="device", y="session_length_sec", color="device", title='Boxplot of sessions length for two device types')
fig.show()
Range of the sessions length is quite big: from 0 to 84,480 seconds per session. An average session length (ASL) regardless type of device is 643.5 sec. Median value is 300 seconds. As distribution is positively scewed with a long tail to the right, median value will be more informative. Standard deviation is 1016.33. Middle 50% of sessions are from 120 to 840 sec.
Median value for touch devices is shorter by 1/3, among desktop users much more those who spend on the site over 2040 sec. (upper fence). Ether app is more convinient or site is more interesting.
Mostly sessions are very short. There are almost no sessions over 5000 sec (1.50 hours). We can assume that users come for some specific information and that time is enough to check the necessary event and order tickets.
Presumably that means that UX is good. On the other hand it can be a signal that site or app do not hook users with additional info or offers.
#defining first activity date per user:
visits['start_ts'] = pd.to_datetime(visits['start_ts'])
first_activity_date = visits.groupby(['uid'])['start_ts'].min()
first_activity_date.name = 'first_activity_date'
user_activity = visits.join(first_activity_date,on='uid')
user_activity['first_activity_date'] = pd.to_datetime(user_activity['first_activity_date'])
user_activity.head()
#defining cohort lifetime (months)
user_activity['start_ts'] = user_activity['start_ts'].astype('datetime64[M]')
user_activity['activity_month'] = pd.to_datetime(user_activity['start_ts'], unit='d') - pd.to_timedelta(user_activity['start_ts'].dt.dayofweek, unit='d')
user_activity['first_activity_month'] = pd.to_datetime(user_activity['first_activity_date'], unit='d') - pd.to_timedelta(user_activity['first_activity_date'].dt.dayofweek, unit='d')
user_activity['first_activity_month'] = user_activity['first_activity_month'].astype('datetime64[M]')
user_activity['cohort_lifetime'] = user_activity['activity_month'] - user_activity['first_activity_month']
user_activity['cohort_lifetime'] = user_activity['cohort_lifetime'] / np.timedelta64(1,'M')
user_activity['cohort_lifetime'] = user_activity['cohort_lifetime'].round().astype(int)
user_activity.head()
#checking calculations
negative_lifetime = user_activity.query('cohort_lifetime<0')
negative_lifetime.info()
#grouping in cohorts
cohorts = user_activity.groupby(['first_activity_month','cohort_lifetime']).agg({'uid':'nunique'}).reset_index()
initial_users_count = cohorts[cohorts['cohort_lifetime'] == 0][['first_activity_month','uid']]
initial_users_count = initial_users_count.rename(columns={'uid':'cohort_users'})
cohorts = cohorts.merge(initial_users_count,on='first_activity_month')
cohorts['retention'] = cohorts['uid']/cohorts['cohort_users']
cohorts.head()
cohorts_2017 = cohorts[cohorts['first_activity_month'].dt.year == 2017].reset_index()
#heatmap for 2017
retention_pivot_2017 = cohorts_2017.pivot_table(index='first_activity_month',columns='cohort_lifetime',values='retention',aggfunc='sum')
sns.set(style='white')
plt.figure(figsize=(13, 9))
plt.title('Cohorts: User Retention Rate for 2017')
sns.heatmap(retention_pivot_2017, annot=True, fmt='.1%', linewidths=1, linecolor='k');
Monthly retetnion rate shows that in 2017 from 6% to 23% of users Afisha retained in a month after the 1st visit. The highest retention rates were among users that came in August and in October (22% and 24% respectively). In the following months their retention rate fell dramatically (to 7% and 6%) and remained low. The lowest retention rate was among users that came in December (6%). RR was at the level of 2-4% for all cohorts by the end of the period. These values are not high.
#for year 2018
cohorts_2018 = cohorts[cohorts['first_activity_month'].dt.year == 2018].reset_index()
retention_pivot_2018 = cohorts_2018.pivot_table(index='first_activity_month',columns='cohort_lifetime',values='retention',aggfunc='sum')
sns.set(style='white')
plt.figure(figsize=(13, 9))
plt.title('Cohorts: User Retention Rate for 2018')
sns.heatmap(retention_pivot_2018, annot=True, fmt='.1%', linewidths=1, linecolor='k');
#defining median value of retantion rate(2017):
cohorts_2017['retention'].median()
#defining median value of retantion rate (2018):
cohorts_2018['retention'].median()
In 2018 the highest retention rates were among users that came in January, February and April (21%, 21% and 19% respectively). They are followed by the strong decline as well. March cohort looks like a lame one with only 4% of visitors coming back in the first month. Such situation (just like in December 2017) might be caused by poorly targeted marketing campain that brought untargeted audience to the site.
Afisha's visitors demonstrate strong seasonal fluctuation with low season in summer and high season in fall-winter with picks starting in October and ending in December. We can see that analysing MAU, DAU and WAU values.
We can see also that on average number of Afisha's visits grow through the time. MAU have seen a growth by 17% in 2018 in comparison to 2017. Stick factor is relatively low, that shows that users do not stay on the site.
Average number of sessions per day on the site is 811 (median value) with middle 50% is between 636 and 1280 visits per day.
The number of visits via touch devices is significantly lower than via desktop ones: 266 vs 721.
General trends (such as seasonal peaks and traughs) are common for both device types, although touch users demonstrate less fluctuations during the period.
An average session length (ASL) regardless type of device is 643.5 sec. A median value of session length for touch devices visits is shorter than desktop ones (180 sec vs 360 sec). Distibution of desctop devices sessions has larger tail to the right with bigger number of outliers.
We can see that Afisha's customers prefer to use desctope devices and stay on the site longer than with the app.
An average retantion rate is not high but has grown from 5% in 2017 to 19% in 2018. In 2017 the highest retention rates were among users that came in August and in October (every 5th user came back next month). The lowest retention rate was among users that came in December (only 6% of users came back next month).
In 2018 the most loyal were users of January, February and April cohorts. Cohorts of both years demonstrate retention rate of 2-3% by the end of their lifetime period. </br>
In general we can see that almost every 5th visitor that came to Afisha site tend to come back in a month. Although the next month retantion rate falls down to the level of 4-7% which is quite a drastic fall.
Afisha might think of some marketing compaigns and offers to make visitors keep coming back during longer period of time as it is easer and cheaper to make a visitor come back than to bring a new one.
We are going to join data from 2 tables (visits and orders) to define the difference between time of the first visit and the first order breaking the results by ad source id and by cohorts.
visits_for_merge = visits[['uid', 'start_ts', 'end_ts', 'source_id', 'device']]
visits_for_merge.head()
visits['uid'].nunique()
orders_for_merge = orders[['uid', 'buy_ts', 'revenue']]
orders_for_merge.head()
orders_for_merge['uid'].nunique()
#merging the tables and dropping duplicates
orders_visits = pd.merge(orders_for_merge, visits_for_merge.drop_duplicates(subset=['uid']),how='inner').query('buy_ts>start_ts')
orders_visits
#defining first order date
orders_visits['buy_ts']=pd.to_datetime(orders_visits['buy_ts'])
first_order_date_by_customers = orders_visits.groupby('uid')['buy_ts'].min()
first_order_date_by_customers.name = 'first_order_date_by_customers'
#first_order_date_by_customers
orders_visits = orders_visits.join(first_order_date_by_customers,on='uid')
orders_visits['first_order_date'] = orders_visits['first_order_date_by_customers'].values.astype('datetime64[D]')
#defining first visit date
orders_visits['start_ts']=pd.to_datetime(orders_visits['start_ts'])
first_visit_date = orders_visits.groupby('uid')['start_ts'].min()
first_activity_date = orders_visits.groupby(['uid'])['start_ts'].min()
first_activity_date.name = 'first_activity_date'
orders_visits = orders_visits.join(first_activity_date,on='uid')
orders_visits['first_activity_date'] = orders_visits['first_activity_date'].values.astype('datetime64[D]')
#calculating how many days it takes to order after the fiirst visit
orders_visits['time_to_conversion'] = orders_visits['first_order_date'] - orders_visits['first_activity_date']
orders_visits['time_to_conversion'] = (orders_visits['time_to_conversion']/ np.timedelta64(1,'D')).astype(int)
orders_visits.head()
orders_visits['time_to_conversion'].describe()
orders_visits['time_to_conversion'].value_counts()
#adding a column with category of conversion
def conversion(days):
if days == 0:
return '1_conversion_0d'
if days > 0 and days <= 7:
return '2_conversion_7d'
if days > 7 and days <= 14:
return '3_conversion_14d'
if days > 14 and days <= 21:
return '4_conversion_21d'
if days > 21 and days <= 30:
return '5_conversion_30d'
if days > 30 and days <= 60:
return '6_conversion_2nd_m'
return '7_over_2months'
orders_visits['conversion_category'] = orders_visits['time_to_conversion'].apply(conversion)
orders_visits['conversion_category'].value_counts()
In order to group by cohorts we will define cohort lifetime:
#defining cohort lifetime (months)
orders_visits['start_ts'] = orders_visits['start_ts'].astype('datetime64[M]')
orders_visits['activity_month'] = pd.to_datetime(orders_visits['start_ts'], unit='d') - pd.to_timedelta(orders_visits['start_ts'].dt.dayofweek, unit='d')
orders_visits['first_activity_month'] = pd.to_datetime(orders_visits['first_activity_date'], unit='d') - pd.to_timedelta(orders_visits['first_activity_date'].dt.dayofweek, unit='d')
orders_visits['first_activity_month'] = orders_visits['first_activity_month'].astype('datetime64[M]')
orders_visits['cohort_lifetime'] = orders_visits['activity_month'] - orders_visits['first_activity_month']
orders_visits['cohort_lifetime'] = orders_visits['cohort_lifetime'] / np.timedelta64(1,'M')
orders_visits['cohort_lifetime'] = orders_visits['cohort_lifetime'].round().astype(int)
#grouping in cohorts
cohorts_conversion = orders_visits.groupby(['first_activity_month','cohort_lifetime']).agg({'uid':'nunique', 'time_to_conversion':'mean'}).reset_index()
initial_users_count1 = cohorts_conversion[cohorts_conversion['cohort_lifetime'] == 0][['first_activity_month','uid']]
initial_users_count1 = initial_users_count1.rename(columns={'uid':'cohort_users'})
cohorts_conversion = cohorts_conversion.merge(initial_users_count1,on='first_activity_month')
cohorts_conversion.head()
pivot_conversion_general = orders_visits.pivot_table(index='first_activity_month',columns='conversion_category',values='uid',aggfunc='nunique')
sns.set(style='white')
plt.figure(figsize=(20, 15))
plt.title('Cohorts: User convertion time')
sns.heatmap(pivot_conversion_general, annot=True, fmt='g', linewidths=1, linecolor='grey');
Conversion during the day of the first visit is the most often situation among those visitors who buy something from Afisha site. The biggest number of 'd0' converted visitors came in October, December and January.
Less people buy during the week after their first visit. The most active stayed the same cohorts of October, December and January plus November one.
Number of people that buy during the following weeks of the first month is much lower and keep decreasing towards the end of the 1st month. During the 2nd month visitors bought aproximately as much as during the last 2 weeks of the 1st month. After 2nd month since the first visit mostly kept buying visitors from cohort of June, July, September, October and November.
It looks like ad compaigns that brought visitors of October ,November and December cohorts were the best targeted ones.
Next we will look at the conversion time by the ad sources.
First we will analyse an average conversion time by the ad sources.
conversion_grouped_by_sourses = orders_visits.groupby(['source_id']).agg({'time_to_conversion':'mean'}).sort_values(by='time_to_conversion', ascending=True).reset_index()
conversion_grouped_by_sourses['source_id'] = conversion_grouped_by_sourses['source_id'].astype(str)
conversion_grouped_by_sourses.reset_index(drop=True)
#plotting a barchart for conversion time by ad sources, days
plt.figure(figsize=(12,8))
# plot barh chart with index as x values
ax = sns.barplot(conversion_grouped_by_sourses.index, conversion_grouped_by_sourses.time_to_conversion, palette="gist_heat")
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="source_id", ylabel='time_to_conversion')
ax.set_xticklabels(conversion_grouped_by_sourses.source_id)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(conversion_grouped_by_sourses["time_to_conversion"].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
plt.show()
On average visitors that came from ad sources no. 10, 1 and 5 make a purchase in 9-10 days; those from ad sources no. 2, 3 and 4 make a purchase in 12-14 days, and the longest it take for those who came from ad sources no. 9 (it's almost a month).
Now we will analyse the how many visitors from different sources converted within certain time periods.
#grouping by ad source
pivot_conversion_source = orders_visits.pivot_table(index='source_id',columns='conversion_category',values='uid',aggfunc='nunique')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Convertion time by different ad sources')
sns.heatmap(pivot_conversion_source, annot=True, fmt='g', linewidths=1, linecolor='grey');
Ad sources no. 4 , 3 and 5 brought the most visitors that made orders during the day of their first visit. Big amount of these visitors kept converted during the following week and were active even after two months after their first visit.
Ad sources no. 1 and 2 look equally effective. Visitors from these sources also keep buying during a week after their first visit. Visitors from source no. 2 were twice more active than the ones from source no. 1 after the 2nd month.
The least effective ad source is no. 9. But relaitively big share of visitors kept converted during the following period of time using this ad source.
Ad sources no. 7 has brought only 1 conversion.
Next we will check how soon the users make an order during one session.
#merging the tables to get info only on visits with orders to look at them in details.
merged = orders_for_merge.merge(visits_for_merge, on='uid', how='inner').query('start_ts<=buy_ts<=end_ts')
merged
We have lost 596 entries during the merge in comparison to the size of 'orders_for_merge' db. That might mean that either visits df does not contain all the data on visits with orders or some rows had entry errors and quiery condition was not met. We will perform the calculations with the valid data that we got.
#defining date of the activity date
merged['start_ts'] = pd.to_datetime(merged['start_ts'])
merged['buy_ts'] = pd.to_datetime(merged['buy_ts'])
merged['end_ts'] = pd.to_datetime(merged['end_ts'])
first_activity_date1 = merged.groupby(['uid'])['start_ts'].min()
first_activity_date1.name = 'first_activity_date'
user_activity1 = merged.join(first_activity_date1,on='uid')
user_activity1['first_activity_date'] = pd.to_datetime(user_activity1['first_activity_date'])
user_activity1.head()
#defining date and time of the first order
first_order_date = merged.groupby(['uid'])['buy_ts'].min()
first_order_date.name = 'first_order_date'
user_activity2 = user_activity1.join(first_order_date,on='uid')
user_activity2['first_order_date'] = pd.to_datetime(user_activity2['first_order_date'])
user_activity2.head()
#defining time to conversion, minutes
user_activity2['time_to_conversion'] = user_activity2['first_order_date'] - user_activity2['first_activity_date']
user_activity2['time_to_conversion'] = (user_activity2['time_to_conversion']/ np.timedelta64(1,'m')).astype(int)
user_activity2.head()
fig = px.box(user_activity2, y="time_to_conversion", title='Boxplot of time to conversion, min.')
fig.show()
user_activity2['time_to_conversion'].describe()
In average users by something on the site on the 5th minute (median value is 5 minutes). Standard devgiation is 14.9. Most of the distribution lies within fences 0-29 minutes (lower and upper fence), middle 50% of users take between 2 and 13 minutes to convert. Distribution is not normal, it is positively scewed with quite a lot of outliers.
coversion_time_by_sources = user_activity2.groupby(['source_id']).agg({'time_to_conversion':'mean'}).sort_values(by='time_to_conversion', ascending=True).reset_index()
coversion_time_by_sources
#plotting a barchart for conversion time by ad sources, minutes
plt.figure(figsize=(12,8))
# plot barh chart with index as x values
ax = sns.barplot(coversion_time_by_sources.index, coversion_time_by_sources.time_to_conversion, palette="gist_heat")
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="source_id", ylabel='time_to_conversion')
ax.set_xticklabels(coversion_time_by_sources.source_id)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(coversion_time_by_sources["time_to_conversion"].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
plt.title('Time to conversion by ad sources, minutes')
plt.show()
Users that came from different sources usually make a purchase during 8-12 minutes. Only a user that came from source no. 7 took 52 minutes to convert (we have already seen that there was only 1 visitor from this ad source). Marketing department should pay attention to this ad source. `
Now we will check ad sources efficency for different types of devices
#Creating a table with orders split by devices and ad sources.
device_sources = orders_visits.groupby(['source_id','device']).agg({'buy_ts':'count'}).reset_index()
device_sources = device_sources.rename(columns={'buy_ts': 'user_orders'})
device_sources_pivot = device_sources.pivot(index='source_id',columns='device',values='user_orders')
device_sources_pivot.head()
#pie charts for source id shares
from plotly.subplots import make_subplots
import plotly.graph_objects as go
labels = ['1', '2', '3', '4', '5', '7', '9', '10']
value1 = device_sources_pivot['desktop']
value2 = device_sources_pivot['touch']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['desktop', 'touch'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
name="desktop"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
name="touch"), 1, 2)
fig.update_layout(title_text='Shares of Ad Sources by Orders')
fig.show()
For desktops ad sources no. 3 and 4 are equaly effective and bring together half of all converted visitors. Source no. 5 is 3rd most popular one and it brought 19% of orders.
For touch devices source no. 4 turned out to be the most effective, it brought every 3rd order to the site. Next popular and effective one is no. 4 with 20% of orders to the site. 16% and 15% of orders came from sources no. 5 and no. 2 respectively. Source no. 1 accounts for 13% of orders, and this number is almost the same for this source on decktops.
Generally the most effective ad sources are equally effective for both types of devices, as well as the least effective (no. 10, 9, 7) work poorly for both desktop and touch.
#calculating number of orders by ad source
pivot_orders_source = orders_visits.pivot_table(index='source_id',columns='conversion_category',values='buy_ts',aggfunc='count')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Number of orders by ad source')
sns.heatmap(pivot_orders_source, annot=True, fmt='g', linewidths=1, linecolor='grey');
Sources no. 3 and 4 are the most effective with the largest number of orders made during 1st day of the site visit (6992 and 6584 orders). No. 5 source is also among leaders with 5394 orderd on the 1st day.
We will calculate an average revenue per order for each cohort and also for each ad source. Also we will calculate total revenue sum per ad source to check their efficency.
#calculating an average revenue per order per each cohort
pivot_revenue_cohort = orders_visits.pivot_table(index='first_activity_month',columns='conversion_category',values='revenue',aggfunc='mean')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Average purchase size by cohorts')
sns.heatmap(pivot_revenue_cohort, annot=True, fmt='g', linewidths=1, linecolor='grey');
Figures show that visitor do not tend to place big orders immediately, numbers of purcheses on the 1st day of visit are quite moderate. The orders get bigger with the time and reach their maximum values on the 3rd and 4th weeks of the first month since the 1st visit. The biggest average revenue the site got from May cohort on the 3rd week and for orders made after 2nd month. June cohort is also one of the leadrs with orders made on the 3rd and 4th weeks.
#calculating an average revenue per order per each ad source
pivot_revenue_source = orders_visits.pivot_table(index='source_id',columns='conversion_category',values='revenue',aggfunc='mean')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Average purchase size by ad source')
sns.heatmap(pivot_revenue_source, annot=True, fmt='g', linewidths=1, linecolor='grey');
Here figures show that sourses no. 1, 2, 4 and 5 bring the higest everage revenue per order. The numbers reach their maximum also by the end of the 1st month. Here source no. 1 and 4 are the leaders. Source no. 9 brought good orders on the 2nd week. Still the biggest average revenue was broght by the source no. 1.
#combining the split by cohorts and by sources
pivot_revenue_source_cohort = orders_visits.pivot_table(index='first_activity_month',columns='source_id',values='revenue',aggfunc='mean')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Average purchase size by ad source and by cohort')
sns.heatmap(pivot_revenue_source_cohort, annot=True, fmt='g', linewidths=1, linecolor='grey');
Combining the two approaches we got the above table that shows source no. 2 has brought the biggest average revenue of May and March cohorts. This source was also effective for July and October cohorts.
Source no. 1 did hot have such bright numbers but worked well for several cohorts: August, November, June and May. Likewise for June cohort worked effectively ad sources no. 4 and 1.
Ad source no. 9 grought good numbers of September and April cohorts.
We are going to calcutale total revenue split by ad sources and then based on these figures estimate revenue per customer (LTV).
#calculating an total revenue per each ad source
pivot_revenue_source = orders_visits.pivot_table(index='source_id',columns='conversion_category',values='revenue',aggfunc='sum')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Total revenue by ad source')
sns.heatmap(pivot_revenue_source, annot=True, fmt='g', linewidths=1, linecolor='grey');
The most effective in terms of total revenue was source no. 4, 3 and 5. They brougt the most total revenue during the 1st day. While during the 1st week of conversion the most effective were sources no. 4 and 1. The source no. 4 keept being the most effective even after the 2nd month.
We will also add a table with revenue per month per source to use it in furthe calculations.
#combining the split by cohorts and by sources
pivot_revenue_source_cohort = orders_visits.pivot_table(index='first_activity_month',columns='source_id',values='revenue',aggfunc='sum')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Total revenue by ad source and by cohort')
sns.heatmap(pivot_revenue_source_cohort, annot=True, fmt='g', linewidths=1, linecolor='grey');
Here the most profitable turned out to be sources no. 4 (cohorts of October, November, December and February) and no. 3 (cohorts of October, Novenber and December).
orders_visits['order_month'] = orders_visits['buy_ts'].astype('datetime64[M]') #column with order month
pivot_revenue_source_monthly = orders_visits.pivot_table(index='source_id',columns='order_month',values='revenue',aggfunc='sum')
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Total revenue by ad source monthly')
sns.heatmap(pivot_revenue_source_monthly, annot=True, fmt='g', linewidths=1, linecolor='grey');
The most lucrative were sources no. 4 and 3, the least effective ones were no. 9 and 10. AS to the months March and October brought the most amount of revenue.
#Creating a table with revenue split by devices and ad sources.
device_revenue = orders_visits.groupby(['source_id','device']).agg({'revenue':'sum'}).reset_index()
device_revenue['revenue'] = device_revenue['revenue'].fillna(0)
device_revenue_pivot = device_revenue.pivot(index='source_id',columns='device',values='revenue')
device_revenue_pivot.head()
#pie charts for source id shares
labels = ['1', '2', '3', '4', '5', '7', '9', '10']
value1 = device_revenue_pivot['desktop']
value2 = device_revenue_pivot['touch']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['desktop', 'touch'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
name="desktop"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
name="touch"), 1, 2)
fig.update_layout(title_text='Shares of Ad Sources by Revenue')
fig.show()
For desktops the most lucrative are ad sources no. 4 and 3, together they accout for 45% of revenue. Source no. 5 and 2 are slightly less effective but still brought 37% of revenue.
For touch devices source no. 4 is the most effective, it brought 1/3 of the every 3th order to the site. The sources no. 2 and 3 brought together 36% of revenue.
The least effective sources for both types are no. 7, 9 and 10.
#Calculating LTV deviding revenue by number of new customers per every group of conversion:
LTV = pivot_revenue_source / pivot_conversion_source #pivot_conversion_source = number of unique visitors
#Calculating LTV deviding revenue by number of new customers per month
pivot_conversion_source_monthly = orders_visits.pivot_table(index='source_id',columns='order_month',values='uid',aggfunc='nunique')
LTV_monthly = pivot_revenue_source_monthly / pivot_conversion_source_monthly
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('LTV by ad source monthly')
sns.heatmap(LTV_monthly, annot=True, fmt='g', linewidths=1, linecolor='grey');
The calculations show that ad source no. 1 and 2 brings customers with the most revenue. Their customers bring the most revenue per capita during all spring months. Also for the source no. 2 Autumn months arevery effective, especially October. Sources no. 1, 2 and 4 are also more lucative than the rest but still fare worse than no.9. The least revenue per capita brings source no. 10.
Having merged the visits table with the oders table we have found that it takes from 0 to 295 days a visitor to convert since the date of his first visit. If we split the visitors by cohorts we can see that the most of them convert duriing the same day as they visit the site. The biggest number of 'd0' converted visitors came in October, December and January. These cohorts look like the best targrted ones.
Ad sources no. 4 , 3 and 5 brought the biggest number of visitors that converted on the day of their first visit and during the 1st week since their 1st visit. The same we can say about the number of orders.
An average time to convert during 1 session is from 8.5 to 12 minutes.
As to the different type of devices, ad sources no. 3 and 4 bring the most number of converted visitors both to desktop and to touch ones. The least effective are also the same: no. 9 and 10.
Analysing average size of a purchase we can say that May, June and October cohorts had the biggerst numbers here.
If we split it by the ad sources we can see that sourses no. 1, 2, 4 and 5 bring the higest everage revenue per order.
Combining the two approaches we got source no. 2 as the most effective for May, July, October and March cohorts. The next leading ad source is no. 1 that worked well for August, November, June and May cohorts.
As to the total revenue the most profitable turned out to be sources no. 4 (cohorts of October, November, December and February) and no. 3 (cohorts of October, Novenber and December).
As to device type, both for desktops and for touch devices the most lucrative is sources no. 4. On the 2nd place for desctops is source no. 3 and for touch ones is source no. 2.
LTV figures show that ad sources no. 2 is the most effective one.
We are going to analyse overall costs, costs per source and their dynamic.
#calculating total costs
total_costs = costs['costs'].sum()
print('Value of overall costs for the whole period is ', total_costs, '.')
#Calculating costs per ad source
costs_per_source = costs.groupby('source_id')['costs'].sum().sort_values().reset_index()
costs_per_source
#plotting a barchart for costs by ad sources
plt.figure(figsize=(12,8))
# plot barh chart with index as x values
ax = sns.barplot(costs_per_source.index, costs_per_source.costs, palette="gist_heat")
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="source_id", ylabel='costs')
ax.set_xticklabels(costs_per_source.source_id)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(costs_per_source["costs"].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
plt.show()
The most expencive ad source is source no. 3 and the least expencive ones are no. 9 and 10.
#getting the costs structure
fig = px.pie(costs_per_source, values='costs', names='source_id', title="Structure of costs", color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()
The value of total costs for the whole period is 329131.62. 43% of them company spends on source no. 3. Sources no. 4 and 5 receive 18 and 15 % of all costs respectively.
#Calculating dynamic of costs
costs_per_date = costs.groupby('dt')['costs'].sum().reset_index()
costs_per_date
sns.lineplot(x="dt", y="costs", data=costs_per_date, markers=True, dashes=True)
plt.xticks(rotation=90)
plt.title('Costs dynamic 2017-2018')
plt.show()
Generally costs dynamic corresponds with the dynamic of users activity with peak periods from October to December and low season in summer.
#to see the bigger picture we will group the costs per week
costs['costs_year'] = costs['dt'].dt.year
costs['costs_month'] = costs['dt'].dt.month
costs['costs_week'] = costs['dt'].dt.week
#plotting weekly costs chart
(costs.groupby(['costs_year', 'costs_week'])
.agg({'costs': 'sum'})
.plot(y='costs', kind='bar', grid=True, figsize=(20,5), cmap='PiYG', title='Number of Weekly Costs')
)
plt.show()
It will be interesting to compare with dynamic of weekly visits:
# WAU chart
(visits.groupby(['visit_year', 'visit_week'])
.agg({'uid': 'count'})
.plot(y='uid', kind='bar', grid=True, figsize=(20,5), cmap='PiYG', title='Number of Weekly Active Users')
)
plt.show()
We can see correlarion between ad spendings and user activity. The bigger costs the more visits gets the site: rise of costs on week 23/2017 corresponds with rise of visits on the same week, the same dinamic we can see on weeks 28/2017, 40/2017, 47/2017, 12/2018 and 20/2018. Rise of spendings on weeks 43/2017, 50/2017 and 10/2018 did not result in rise of visits. Ad compaigns of these weeks should be analysed more thoroughly.
Dinamics of costs generally corresponds with the dynamics of user activity. It might mean that almost all marketing activities generate users flow.
We are going to calculate the cost of acquisition (CAC per user per source) by calculating the cost of marketing per source and then dividing it by number of users that came from that source.
#Costs per ad source per month
#adding column with month of costs
costs['dt'] = costs['dt'].astype('datetime64[M]')
#grouping data in pivot table
pivot_costs_source_monthly = costs.pivot_table(index='source_id',columns='dt',values='costs',aggfunc='sum')
#plotting heatmap for the pivot table
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Costs per ad source per month')
sns.heatmap(pivot_costs_source_monthly, annot=True, fmt='g', linewidths=1, linecolor='grey');
Now we will get all the converted users per month per ad source in one table
#calculating number of converted users per month per ad source
pivot_users_source_monthly = orders_visits.pivot_table(index='source_id',columns='order_month',values='uid',aggfunc='nunique')
#plotting heatmap for the pivot table
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Unique customers per ad source per month')
sns.heatmap(pivot_users_source_monthly, annot=True, fmt='g', linewidths=1, linecolor='grey');
The new customers are mainly attracted by sources no. 3, 4 and 5. And the months with the biggest number of new converted users are December, February and March.
#calculating CAC
CAC = pivot_costs_source_monthly / pivot_users_source_monthly
#plotting heatmap for the CAC
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('CAC per ad source per month')
sns.heatmap(CAC, annot=True, fmt='g', linewidths=1, linecolor='grey');
As the calculations show despite the fact that the company spent the most of the marketing budget on source no. 3, sources no. 1, 5, 4, 9 and 10 bring the biggest number of converted users with the least expences.
Thus the most expencive users come from sources no. 2 and 3. The most expensive customers were brought by source no. 3 in August (26.03 per capita) and in July (23.11 per capita).
For source no. 4 the least effective compaign was in JUne and in August with the most expencive converted users. (12.6 per user).
Source no. 5 the least effective compaign was in November with the most expencive converted users. (12.6 per user).
Source no. 9 has average price of conversion but its campaign in July was the least effective with a very high price of conversion.
Sorces no. 1 and 10 are the sources of mainly cheap conversions.
Now we will determine which ad sources are profitable and which are not (if LTV < CAC)
Sources_profitability = LTV_monthly -CAC
#plotting heatmap for the sources profitability
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('Profitability of ad sources')
sns.heatmap(Sources_profitability, annot=True, fmt='g', linewidths=1, linecolor='grey');
The heatmap clearly illustrates that source no. 3 performs the least effective way making the haviest loses (especially in July and August). The best performance has source no. 1 with the biggest number of months with LTV > CAC. In May there was the biggest number of ad sources with LTV > CAC.
We will determine ROMI (return on marketing investment) by dividing LTV by CAC
#creating a general table for calculations
report = orders_visits[['uid', 'buy_ts', 'first_order_date', 'revenue']]
report['buy_ts'] = report['buy_ts'].astype('datetime64[M]')
report['first_order_date'] = report['first_order_date'].astype('datetime64[M]')
report = report.rename(columns={'buy_ts': 'order_month', 'first_order_date':'first_order_month'})
report
cohort_sizes = report.groupby('first_order_month').agg({'uid': 'nunique'}).reset_index()
cohort_sizes.columns = ['first_order_month', 'n_buyers']
cohort_sizes.head()
cohorts = report.groupby(['first_order_month','order_month']).agg({'revenue': 'sum'}).reset_index()
cohorts.head()
cohorts_general = pd.merge(cohort_sizes, cohorts, on='first_order_month')
cohorts_general.head()
cohorts_general['cohort_age'] = (cohorts_general['order_month'] - cohorts_general['first_order_month']) / np.timedelta64(1, 'M')
cohorts_general['cohort_age'] = cohorts_general['cohort_age'].round().astype('int')
cohorts_general.head()
ppp = cohorts_general.pivot_table(
index='first_order_month',
columns='cohort_age',
values='n_buyers').round().fillna('')
ppp
cohorts_general['ltv'] = cohorts_general['revenue'] / cohorts_general['n_buyers']
report_pivot = cohorts_general.pivot_table(
index='first_order_month',
columns='cohort_age',
values='ltv',
aggfunc='mean').round()
report_pivot.fillna('')
costs_ = costs[['dt', 'costs']]
costs_['dt'] = costs_['dt'].astype('datetime64[M]')
costs_
costs_m = costs_.groupby(['dt']).agg({'costs': 'sum'}).reset_index()
costs_m
report_general = pd.merge(cohorts_general, costs_m, left_on='order_month', right_on='dt')
report_general['cac'] = report_general['costs'] / report_general['n_buyers']
report_general.head()
report_general['romi'] = report_general['ltv'] / report_general['cac']
romi_pivot = report_general.pivot_table(
index='first_order_month',
columns='cohort_age',
values='romi',
aggfunc='mean')
romi_pivot.cumsum(axis=1).round(2).fillna('')
Neither of cohorts have paid off during their lifetime (ROMI<1). March/2018 cohort on the 3rd month has gotten the most close to pay-off of all the othe cohorts (ROMI 0.72). The most far from pay-off stage is January cohort with ROMI < 50% by the end of its lifetime.
To define ROMI per ad source we will divide LTV by source/CAC by source:
ROMI = LTV_monthly/CAC
#plotting heatmap for ROMI
sns.set(style='white')
plt.figure(figsize=(20, 9))
plt.title('ROMI per ad sources')
sns.heatmap(ROMI, annot=True, fmt='g', linewidths=1, linecolor='grey');
ROMI calculations prove that source no. 3 does not pay-off during the whole period. The best source from the point of view of investments is source no. 1, it has the biggest number of months with LTV>CAC. Second best are sources no. 2 and 9.
Total costs for the whole period is 329131.62. Analysing costs of the company we can see that 43% of its marketing costs are spent on ad source no. 3; source no. 4 is on the 2nd plase and the least investments got source no. 9.
We can see also that there is a correlarion between ad spendings and user activity. The bigger costs the more visits gets the site.
Also dinamic of costs has the same seasonal fluctuaions with peak season starting in September and ending in March. </br>
Splitting costs by ad sources by months we can see that during the Autumn-Winter (high season) marketing spendings rise for all channels but the leading one is source no. 3 followed up by sources no. 4, 5 and 2.
But if we take into account the number of converted customers and calculate CAC, we can clearly see that source no. 3 is the least effective as it has the highest costs of converted users. The most effective channels in this respect are sources no. 1, 5, 4, 9 and 10.
ROMI value shows that neither of the cohorts got to the pay-off stage, though March cohort got closer to it (with ROMI = 0.72).
If we split it by the ad channels it is getting clear that the source no. 1 is the best one from the point of view of investments as it has the biggest number of months with LTV>CAC (from December 2017 to May 2018).
Than go sources no. 2 and 9 with 2 positive months each.
The company should prioritise the sources no. 1, 2 and 9 when forming the next marketing budgets.
In cource of work we have studied Afisha's product itself, its's marketing channels and sales efficiency.
Afisha's product has strong seasonal fluctuations with low season in summer and high season in fall-winter. Generally activity on site grew through 2017-2018 (DAU, WAU and MAU values grew).
The number of visits via touch devices is significantly lower than via desktop ones: 266 vs 721 per day on average. Touch users demonstrate less fluctuations during the period.
Median value of session length for touch devices visits is shorter than desktop ones (180 sec vs 360 sec).
We can see that Afisha's customers prefer to use desktop devices and stay on the site longer than with the app.
An average retantion rate is not high but grew from 5% in 2017 to 19% in 2018. In 2017 the highest retention rates were among users that came in August and in October (every 5th user came back next month). It might mean that ad campaigns of these periods were the most effectively targeted. Planning the next ad campaigns the marketing department could use positive expetience of August and October campaigns. </br>
Most of the converted users tent to make orders during their first visit. Although amount of orders got bigger among those who came back later. It should be taken into account when building the funnel.
As to the revenue and costs is concerned, in 2017-2018 marketing costs were bigger than the revenue (329,131.62 vs 252,057.2).
Desktop devices brought more revenue than touch ones.
The biggest ad sources for both types of devices were no. 3 and 4, together they brought 50% of orders. Sources no. 9 and 10 account only for 6% of orders.
But if we take into account not only the number of visitors and orders but also costs of their aquisition we will see that source no. 3 is the least effective one. CAC and ROMI show that this channel has the higest cost of user aquisition that is not covered by the revenue received. Moreover the source has no positive tendency, during all the months CAC is 18-35% bigger than LTV.
The most effective sources judging by ROMI value are no. 1, 2 and 9. They account for the least number of orders (for both types of devices) but the share of revenue brought by sources 1 and 2 is much bigger. (See piecharts 'Shares of Ad Sources by Revenue' and 'Shares of Ad Sources by Orders'). For touch devices source no. 2 performs even better than for desktops.
Based on the calculations above we can suggest to restructure marketing investments the following way.
Sources no. 2, 1, 9 should get the bigger investments and share of marketing budget. Source no. 3 either should not be used at all or the campaign should be retargrted. This channel brings the biggest number of visitors but their orders do not cover their aquisition costs. It might mean that the channel brings poorly targeted audience.
As Afisha users prefer desktop devices, marketing compaigns should be planned accordingly with more attention to site content and its UX.
As soon as Afisha visitors use app less than the site, we would offer firstly to try making it more convenient with better UX and secondly to test some additional channels for touch devices to get more app downloads.
We have seen seasonal fluctuations in user activity and advice Afisha to make them less sharp by:
- getting some seasonal offers and coupons to the site,
- pre-sale of the tickets prior to the start of high season,
- keeping users activity on the site by sending emails.
Afisha might also try some diversification of sources and test other ad sources and channels.