The aim of the project is to identify patterns that determine whether a game succeeds or not. This allows to spot potential big winners and plan advertising campaigns.
The data referrs to 2016.
We will test the following hypotheses:
Part 1. Data preprocessing
Part 2. Exploratory data analysis (EDA)
Defining platforms leading in sales
Platforms with the greatest total sales and distribution based on data for each year
Platforms that used to be popular but now have zero sales
Defining significant period for the data
Defining potentially profitable platforms
Box plot for the global sales of all games, broken down by platform
Correlation between reviews and sales
General distribution of games by genre
Creation of a user profile for each region
Top 5 genres by regions
Part 3. Hypotheses testing
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from functools import reduce
from scipy import stats as st
import numpy
from matplotlib import pyplot
import plotly
import squarify
import seaborn as sns
df = pd.read_csv('./games.csv', sep=',')
df.head()
df.sample(6)
# Map the lowering function to all column names
df.columns = map(str.lower, df.columns)
df.head()
df.info()
#stasistical summary of the data
df.describe()
#Checking data for zeros:
for i in df.columns:
print(i, len(df[df[i]==0]))
#statistical summary for categorical variables
df.describe(include=['object'])
df['platform'].value_counts()
Our initial data base has 16715 entries and 11 columns.
Data represents a period of 1980-2016 with 11559 unique names of the games in 12 genres and 31 platforms.
The most frequintly mentioned game is 'Need for Speed', platform - PS2, genre - Action.
We do not have missing values in 4 columns of sales data. 'critic_score', 'user_score', 'rating' have the highest percentage of missing values, over 40%.
As far as the zeros in data is concerned, we can see that some of the games were sold only in NA, only in Europe or only in Japan.
# defining columns with missing values
missing_list=[]
for x in df:
if len(df[df[x].isnull()])>0:
missing_list.append(x)
print('The following columns have missing values: ', missing_list)
# calculating percentage of missing values in every column
missing_percentage=[]
for x in missing_list:
missing_percentage.append([x,(len(df[df[x].isnull()])/len(df))])
missing_percentage=pd.DataFrame(missing_percentage, columns=['column','missing_values_share'])
missing_percentage.sort_values(by=['missing_values_share'], ascending=False)
df.notnull().sum()
df.isnull().any()
#at least one missing value in any column
df[df.isnull().any(axis=1)]
#the subset of our data that has columns without at least 1 missing value
df[df.notnull().any(axis=1)]
#the subset of our data with columns that do not have missing values at all
df[df.notnull().all(axis=1)]
As we have only few missing values in the columns 'year_of_release', 'name' and 'genre' (all these lines together are <2% of the data base), we can drop them to go further with our calculations and this will not influence our study results. As 'name' and 'genre' have the same number of missing vaues, it might be enter data error.
df = df.dropna(axis=0, subset=['name', 'year_of_release', 'genre'])
df.head()
#changing value type to integer to get rid of decimals in the column 'year_of_release'
df['year_of_release'] = df['year_of_release'].astype(int)
df['year_of_release'].head()
df.groupby(['year_of_release']).size().plot(
kind='bar',
grid=True,
title='Number of data without NaNs',
figsize=(10,5)
)
plt.show()
As we can see most of the missing data fall into period 1980-1994, and also there is some fluctuation across the entire sample period. To mitigate the effect of missing data, we will use grouping and transform method where it is applicable.
We have over 40% of missing values in 'critic_score' and 'user_score'.
We will fill Nans for multiplatform games based on the scores of these games on another platforms.
df['critic_score'] = df['critic_score'].fillna(df.groupby(['name'])['critic_score'].transform('first'))
crit_score000 = df['critic_score'].isna().sum()/len(df['critic_score'])
print('Percentage of missing values is {: .2%}'.format(crit_score000))
The percentage of missing values has slightly dropped.
'user_score' column
As the data type here is object we cannot apply the same method for filling nans with median.
df['user_score'].value_counts()
#checking the data types in this column
df.user_score.apply(type).unique()
Supposably 'tbd' has string type and the rest of the data here is float. We will check the percentage of TBDs and the rest values.
#defining share of every value
df['user_score'].value_counts()/len(df)
We can see that there is a 15% of data named TBD. We will have a look at what info contain lines with TBD value.
df[df.user_score=='tbd']
df[df.user_score=='tbd']['year_of_release'].value_counts()
There is no clear pattern in this data, it refers to different years, platform, genres, etc. Obviously that the data cannot be restored. We will replace TBD with 'nan' to restore it further for the muliplatform games. Afterwords we will be able to change the data type.
#replasing tbd with nan
df['user_score'] = df['user_score'].replace('tbd', np.nan)
user_score_nan = df['user_score'].isna().sum()/len(df['user_score'])
print('Percentage of missing values is {: .2%}'.format(user_score_nan))
The persantage of nans has risen to 54%.
Now we are filling Nans for multiplatform games based on the user's scores of these games on another platforms.
df['user_score'] = df['user_score'].fillna(df.groupby(['name'])['user_score'].transform('first'))
user_score000 = df['user_score'].isna().sum()/len(df['user_score'])
print('Percentage of missing values is {: .2%}'.format(user_score000))
#filling nans with 'o' value to change the data type later on.
df['user_score'] = df['user_score'].fillna(0)
#changing the data type to round up the score and reduce the number of categories.
df['user_score'] = df['user_score'].astype(float)
df['user_score'].value_counts()
'rating' column
df['rating'].value_counts()
df['rating'].unique()
We have over 40% of missing values here. We will leave the data as is.
#The data after treating the missing values
df.groupby(['year_of_release']).size().plot(
kind='bar',
title='Number of data without NaNs',
figsize=(10,5)
)
plt.show()
#calculating the sales across the different regions
df['total_sales']= df[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
grouped_sales = df.groupby('year_of_release')['total_sales'].max().sort_values(ascending=False).reset_index()
grouped_sales
Sales of games saw a steady growth though the years from USD 3.5 mln in 1985 to USD 82.54 mnl in 2006. Obviously the year 2006 was the most lucrative one.
#number of games released yearly
release_early = df.groupby('year_of_release')['name'].count().sort_values().reset_index()
release_early.head()
fig,ax=plt.subplots(figsize=(13,10))
ax.vlines(x=release_early.year_of_release, ymin=0, ymax=release_early.name, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=release_early.year_of_release, y=release_early.name, s=75, color='lightgreen', alpha=0.7)
ax.set_title('Lolipop Chart for Released Games', fontdict={'size':16})
ax.set_ylabel('Nuber of game releases')
ax.set_xticks(release_early.year_of_release)
ax.set_xticklabels(release_early.year_of_release,rotation=90, fontdict={'horizontalalignment':'right','size':14})
for row in release_early.itertuples():
ax.text(row.year_of_release, row.name+30,s=round(row.name,2), rotation=90, fontdict={'horizontalalignment':'center','size':10})
Game releases boomed from 1994 and reached its peak at slightly over 1400 games per year in 2008 - 2009, during the following years number of releases decreased. The strongest fall was in 2012 (from over 1100 to slightly less than 700) and since then number of releases stayed stable between 500 to 600 releases per year. Presumedly data from the first years of game production (1985-1994) is fractional and till year 2001 it is not significat due to the minor amount of games released.
Sorting out the platforms with biggest amount of total sales.
platform_sales = df[['platform', 'total_sales']].groupby('platform').sum().sort_values(by='total_sales', ascending=False).reset_index()
platform_sales
#visualizing total sales per platform
(df.groupby('platform')
.agg({'total_sales': sum})
.plot(y='total_sales', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()
We will define how successful is the platform in relation to the mean value of sales. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of total saleas of a platform is identical to the mean score.
#counting z-score for sales to define distribution
platform_sales['sales_z']=(platform_sales['total_sales']-platform_sales['total_sales'].mean())/platform_sales['total_sales'].std()
platform_sales.head()
#distinguishing z-score by colors: red for negative values, blue for positive ones
platform_sales['colors'] = ['red' if x<0 else 'purple' for x in platform_sales['sales_z']]
platform_sales['colors']
plt.figure(figsize=(14,10))
plt.hlines(y=platform_sales.platform, xmin=0, xmax=platform_sales.sales_z, colors=platform_sales.colors, alpha=0.4, linewidth=10)
The graph clearly shows that the most successful platforms with sales above average are to the right (in purple) and less successful are to the left (in red) of the chart.
As we can see the last graph and table the top 6 leading platforms based on the total sales are PS2, X360, PS3, Wii, DS and PS. The rest are far behind the leaders. All the above with total sales over USD 700 mln. Distribution will be built for these platforms.
#nlargest method for defining top 6 successful games
df.nlargest(6,['total_sales'])
Analysis of the most successful games prove our choise of the most successful platforms.
#nsmallest method for defining 5 least popular games
df.nsmallest(5,['total_sales'])
# Drawing Plot of distributoion of sales for leading platforms
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['platform'] == "PS2", 'year_of_release'], shade=True, color="g", label="PS2", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "X360", 'year_of_release'], shade=True, color="deeppink", label="X360", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "PS3", 'year_of_release'], shade=True, color="dodgerblue", label="PS3", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "Wii", 'year_of_release'], shade=True, color="red", label="Wii", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "DS", 'year_of_release'], shade=True, color="orange", label="DS", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "PS", 'year_of_release'], shade=True, color="yellow", label="PS", alpha=.7)
# Decoration
plt.title('Distribution of total sales per year for 6 top platforms', fontsize=22)
plt.legend()
plt.show()
Below we will illustrate this 3-diamentional data with a heatmap chart.
df_select_platforms = df.query('platform == "PS2" or platform == "X360" or platform == "PS3" or platform == "Wii" or platform == "DS" or platform == "PS"')
df_heatmap_platforms=pd.pivot_table(df_select_platforms, index='year_of_release', columns='platform', values='total_sales', aggfunc=sum, fill_value=0)
df_heatmap_platforms
plt.figure(figsize=(13,9))
sns.heatmap(df_heatmap_platforms.T, cmap="RdBu_r")
The heatmap allows us to have another angle of view on the data and get deeper comprehension of the processes.
Buding distribution chart for every platform one by one separately
#1st platform
top_6_platforms_PS2 = df.query('platform=="PS2"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
top_6_platforms_PS2.head(3)
#chart
sns.distplot(top_6_platforms_PS2['year_of_release'], label = 'PS2', color = 'g', bins=12)
#2nd platform
top_6_platforms_X360 = df.query('platform=="X360"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_X360['year_of_release'], label = 'X360', color = 'deeppink', bins=10)
#3rd platform
top_6_platforms_PS3 = df.query('platform=="PS3"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_PS3['year_of_release'], label = 'PS3', color = 'b')
#4th platform
top_6_platforms_Wii = df.query('platform=="Wii"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_Wii['year_of_release'], label = 'Wii', color = 'r')
#5th platform
top_6_platforms_DS = df.query('platform=="DS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_DS['year_of_release'], label = 'DS', color = 'orange', bins=10)
#surching for outlier
DS_outlier = df.query('platform=="DS" and year_of_release <1990')
DS_outlier
A quick search gives us information, that development on the Nintendo DS began around mid-2002. Thus we can drop the line with game name "Shogi DS" as it is obviously not connected to the popular game platform.
df=df.drop(index=15957)
top_6_platforms_DS_no_outlier = df.query('platform=="DS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_DS_no_outlier['year_of_release'], label = 'DS', color = 'orange', bins=8)
plt.figure(figsize=(6,4), dpi= 80)
sns.kdeplot(df.loc[df['platform'] == "DS", 'year_of_release'], shade=True, color="orange", label="DS", alpha=.7)
# Decoration
plt.title('Distribution of total sales per year for SD platform', fontsize=22)
plt.legend()
plt.show()
#6th platform
top_6_platforms_PS = df.query('platform=="PS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_PS['year_of_release'], label = 'PS', color = 'yellow', bins=4)
The charts illustrate that every platform had their life cycle that lasted aproximately 10 years. Some patforms have several peaks in the middle corresponding to the launch of new games.
We are going to build a treemap to get a general picture of distribution of the market.
df_tree = df[['platform', 'total_sales']].groupby('platform').sum().sort_values(by='total_sales', ascending=False).reset_index()
df_tree=df_tree[df_tree['total_sales']>=15]
sizes=df_tree.total_sales.values.tolist()
labels = df_tree.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Treemap of Platforms Total Sales')
#plt.axis('off')
plt.show()
Our 6 markets leaders are all in shades of purple to the left and bottom part of the chart.
All the most popular platforms with the bigest amount of sales by year 2016 demonstrate drop of sales and according to the graphs their life cycle or has ended or heading to an end. Judging by the leaders of the market the period of time from launching the platform to its end is about 9-10 years.
PS platform, the first of the leaders, emerged in 1993 and completely stopped selling by 2004. The top sells were reached by 1998. Decline of sales started right after 1998, that corresponds with the launch of a new future leader (PS2) and boom of its sells . All in all it took 10 years for a platform to get from the first to the last one.
PS2, the top selling platform, did not have such sharp peaks as its predecessor, but it took them only about 3-4 years to get to the top of their sales and then for 4 years their sales were steady and high without any declines. PS2 had never reached the highest peaks of PS sales but in lenght of its stable peak it had no rivals. Decline in sales started in 2005 as 4 other future market leaders were gaining popularity. As to Wikipedia ["On November 29, 2005, the PS2 became the fastest game console to reach 100 million units shipped, accomplishing the feat within 5 years and 9 months from its launch. This achievement occurred faster than its predecessor, the PlayStation, which took 9 years and 6 months since launch" to reach the same figure."](https://en.wikipedia.org/wiki/PlayStation#PlayStation). PS production ended in 2012 thus the platform was active for 12 years. By year 2015 it still kept the rank of the best selling console [of all the time](https://www.theguardian.com/technology/2013/jan/04/playstation-2-manufacture-ends-years?INTCMP=SRCH)
DS released globally across 2004 - 2005. In 2014 production ended, it was added to the Wii
PS3 was released 2006,competed with X360 and Wii launched a year before. It had never reached the amount of sales its main rivals. Shipments of new units ended in 2016. The platform lived for 10 years.
Wii was launched in 2006. From 2013, Internet services were gradually discontinued; and Wii could no longer be purchased after 2018. The platform lived for 12 years.
X360 was launched across 2005–2006 and in 2016 its production ended. 10 years of sales.
The leading platforms had completed their launch in the key markets within one year. After growth of sales they reached their peak and after emrging new sucessful rivals the old ones headed to decline that lasted about 4-5 years.
We are going to build a treemap to investigate life cycle of the platforms.
Taking into account quality of data and amount of sales in the first years of game production we should not take into account data before year 2001. As life cycle of a platform is 10 years on average, data starting from 2001 also will sute the purposes of analysis because this period includes 1.5 of a lifecycle time. This will also be sufficient for building prognosis for 2017.
Building a shiftmap for platforms. Taking into account only the ones released after 2000.
df_after_2000 = df[df.year_of_release>2000]
df_after_2000.head()
df1=pd.pivot_table(df_after_2000, index='year_of_release', columns='platform', values='total_sales', aggfunc=sum, fill_value=0)
df1
df1.shift(+1).tail()
#calculating dynamic of sales
sales_dynamics = df1-df1.shift(+1)
sales_dynamics.tail()
#illustrating the dynamics
plt.figure(figsize=(13,9))
sns.heatmap(sales_dynamics.T, cmap="RdBu_r")
plt.show()
AS to the above calculations all the platforms have drop in sales in 2016 in comparison to 2015. Life cycle of the most platforms has ended by 2016. Either the companies have already ended production of new hardware or their sales are dropping. Based on the current data we cannot choose any potentially profitable platforms.
grouped = df_after_2000.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped
ordered=grouped.groupby(['platform'])['total_sales'].sum().sort_values().reset_index()['platform']
ordered
plt.figure(figsize=(13,10))
sns.boxplot(x='platform', y='total_sales', data=grouped, order=ordered)
The boxplot chart clearly illustrates that platforms perform dramatically different as to their total sales. To the right we have almost the same group of 6 market leaders. As we have chosen only the platformes releases after 2000 year, PS platform has not made it to the group of leaders (most of its games were released well before), but PS4 has.
We will look into leaders' performance in a more detailed way.
#picking leaders
df_after_2000_leaders = df_after_2000[df_after_2000.platform.isin(("PS2", "X360", "PS3", "Wii", "DS"))]
df_after_2000_leaders.head()
grouped_leaders = df_after_2000_leaders.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_leaders.head()
import plotly.express as px
fig = px.box(grouped_leaders, x="platform", y="total_sales", color="platform")
fig.update_traces(quartilemethod="exclusive")
fig.show()
Median values for leaders also vary greatly: from USD 59.65 mln for Wii to 103.42 mln for PS2. </br>
PS2 is a leader with the highest value of median, biggest maximum value; the middle 50% of sales is between 26.4 mln and 184.31 mln.
Wii platform has the lowest value of median among the leaders, distribution of its sales has a long upper wisker and the distribution is scewed to the right. Middle 50% of its sales were between 3.75 and 152.77 mln.
The least succesful among these leaders is DS platform with median value of 102,28 mln, maximum value of 146.94 mln and middle 50% between 17.27 mln and 130.14 mln.
In general the difference in sales is significant even between the leaders.
We will look at data for PS2 as the most popular platform as for 2015.
review_correlation_PS2 = df.query('platform=="PS"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
review_correlation_PS2.head()
grouped_correlation_PS2 = review_correlation_PS2.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
grouped_correlation_PS2.head()
#building scatter plots in pairs for the chosen parameters
pd.plotting.scatter_matrix(grouped_correlation_PS2, figsize=(11, 11))
plt.show()
corrMatrix = grouped_correlation_PS2.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()
Both the scatterplot and the matrix show that there is a strong linear connection between critic scors and total sales of the platform games. User score and total sales have weak negative relationship.
#we will calculate correlation for the least successful of the market leaders, DS
review_correlation_DS = df.query('platform=="DS"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_DS = review_correlation_DS.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_DS.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()
The second matrix shows the same pattern: here is a strong positive correlation between critic scors and total sales of the platform games and week negative relationship between sales and users scores. And there is almost no correlation between users and critic scores.
#corr for a middle outsider, WiiU
review_correlation_WiiU = df.query('platform=="WiiU"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_WiiU = review_correlation_WiiU.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_WiiU.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()
Here we can see that correlation between critic scors and total sales of the platform games is much weaker, almost like user score. Users and critic scores have strong degree of relationship here.
#corr for one of the least successful platform, DC
review_correlation_DC = df.query('platform=="DC"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_DC = review_correlation_DC.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_DC.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()
For an outsider correlation between critic scors and total sales is weak, user scores have strong negative correlation with the sales. Users and critic scores also have strong positive correlation.
All in all for all groups of platforms we can see that linear correlation between the 3 factors is not stable through the platforms.
df['genre'].value_counts()
# As genre is a categorical parameter, distribution will be presented by a bar chart.
(df.groupby('genre')
.agg({'genre': ['count']})
.plot(y='genre', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()
# Distribution of total sales by genre
(df.groupby('genre')
.agg({'total_sales': sum})
.plot(y='total_sales', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()
grouped_genres = df.groupby(['genre', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_genres
# General market view from the total sales point of view
df_genres_tree = df[['genre', 'total_sales']].groupby('genre').sum().sort_values(by='total_sales', ascending=False).reset_index()
df_genres_tree=df_genres_tree[df_genres_tree['total_sales']!=0]
sizes=df_genres_tree.total_sales.values.tolist()
labels = df_genres_tree.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Treemap of Genres Total Sales')
plt.show()
Judging by total sales though all the years of analysis the most lucrative genres are: action, sports and and shooter. Puzzle and strategy have brought the least amount of money.
#Distribution of genres through the years
# Drawing Plot of distributoion for leading platforms
plt.figure(figsize=(16,10), dpi= 80)
sns.set_palette("Set2")
sns.kdeplot(df.loc[df['genre'] == "Action", 'year_of_release'], shade=True, label="Action", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Sports", 'year_of_release'], shade=True, label="Sports", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Shooter", 'year_of_release'], shade=True, label="Shooter", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Racing", 'year_of_release'], shade=True, label="Racing", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Simulation", 'year_of_release'], shade=True, label="Simulation", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Fighting", 'year_of_release'], shade=True, label="Fighting", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Platform", 'year_of_release'], shade=True, label="Platform", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Strategy", 'year_of_release'], shade=True, label="Strategy", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Puzzle", 'year_of_release'], shade=True, label="Puzzle", alpha=.5)
# Decoration
plt.title('Distribution of Games by Genre through 1985-2016', fontsize=22)
plt.legend()
plt.show()
#Having a look at the 3 most profitable and 2 the least profitable genres
plt.figure(figsize=(16,10), dpi= 80)
sns.set_palette("Set2")
sns.kdeplot(df.loc[df['genre'] == "Action", 'year_of_release'], shade=True, label="Action", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Sports", 'year_of_release'], shade=True, label="Sports", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Shooter", 'year_of_release'], shade=True, label="Shooter", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Strategy", 'year_of_release'], shade=True, label="Strategy", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Puzzle", 'year_of_release'], shade=True, label="Puzzle", alpha=.7)
# Decoration
plt.title('Distribution of Games by Genre: sales leaders and losers', fontsize=22)
plt.legend()
plt.show()
Below is presentation of the data on the heatmap.
df_select_genre = df.query('genre == "Action" or genre == "Sports" or genre == "Shooter" or genre == "Strategy" or genre == "Puzzle"')
df_heatmap_genres=pd.pivot_table(df_select_genre, index='year_of_release', columns='genre', values='total_sales', aggfunc=sum, fill_value=0)
df_heatmap_genres
plt.figure(figsize=(13,9))
sns.heatmap(df_heatmap_genres.T, cmap="RdBu_r")
plt.show()
All of the genres are not new, almost all of them emerged in the early 1990th.</br>
If we look at the distribution of the genres throught the years we will see that strategy games emerged at the very beginning of 1990th together with fighting. Puzzles were on the peak of their popularity in 2007-2008 but then quickly lost their leading position. Misc and aventure stepped in by 2010.
Adventure slowly but steadily gain popularity starting from 1995. This genre reached its peak by 2008 as well. Although it was almost twise less popular, the number of its fans stayed stable till the year of analysis, 2016.
As to the sales leader, Action, it boomed by 2008 and keept this level of popularity by 2012. That corresponds to the general market trend with games boom in 2008-2009 and market fall in 2012. After a short decrease it seems to be gaining popularity again and in 2015 it is an undisputable leader among other genres.
Shooter genre has a longer history of success: first peak in 1996 with a slight decrease in 1998, second impessive peak in 2004 that lasted till 2009. And now, in 2015 this genre is the 2nd most popular one despite its steady fall since 2009-2010.
Sports genre had two peaks of poopularity: 2001-2003 and 2008-2009, by 2015 the interest in this genre seems to get a fresh impulse.
All of the market leaders' distriibutions have thicker tails to the right.
platform_sales_na = df[['platform', 'na_sales']].groupby('platform').sum().sort_values(by='na_sales', ascending=False).reset_index()
platform_sales_na
#calculating market share for every platform
platform_sales_na['percentage_na'] = (platform_sales_na['na_sales']/platform_sales_na['na_sales'].sum())*100
platform_sales_na.head()
# NA market view for platforms
platform_sales_na=platform_sales_na[platform_sales_na['na_sales']>=6]
sizes=platform_sales_na.na_sales.values.tolist()
labels = platform_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Platforms Market Shares in NA')
#plt.axis('off')
plt.show()
Five top selling platforms for North America are: X360, PS2, Wii, PS3, DS.
#creating a table
platform_sales_eu = df[['platform', 'eu_sales']].groupby('platform').sum().sort_values(by='eu_sales', ascending=False).reset_index()
#calculating market share for every platform
platform_sales_eu['percentage_eu'] = (platform_sales_eu['eu_sales']/platform_sales_eu['eu_sales'].sum())*100
platform_sales_eu.head()
# EU market view for platforms
platform_sales_eu=platform_sales_eu[platform_sales_eu['eu_sales']>=6]
sizes=platform_sales_eu.eu_sales.values.tolist()
labels = platform_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Platforms Market Shares in EU')
#plt.axis('off')
plt.show()
Five top selling platforms for Europe are: PS2, PS3, X360, Wii, PS.
#creating a table
platform_sales_jp = df[['platform', 'jp_sales']].groupby('platform').sum().sort_values(by='jp_sales', ascending=False).reset_index()
#calculating market share for every platform
platform_sales_jp['percentage_jp'] = (platform_sales_jp['jp_sales']/platform_sales_jp['jp_sales'].sum())*100
platform_sales_jp.head()
# Japan market view for platforms
platform_sales_jp=platform_sales_jp[platform_sales_jp['jp_sales']>=3]
sizes=platform_sales_jp.jp_sales.values.tolist()
labels = platform_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Platforms Market Shares in Japan')
#plt.axis('off')
plt.show()
Five top selling platforms in Japan are: DS, PS, PS2, 3DS, PS3.
All in all we have the following leaders for the 3 regions: X360, PS2, Wii, PS3, DS, PS, 3DS. We will build a stacked bar chart to illustrate the difference in sales more clearly.
#creating a separate column with the names of leading platforms and "others"
top_platforms = ['X360', 'PS2', 'Wii', 'PS3', 'DS', 'PS', '3DS'] #list of leaders
df['platform_name'] = [x if x in top_platforms else 'other' for x in df['platform']]
df['platform_name'].value_counts()
#Creating a table with platform sales split through the markets for the stacked bar chart
platform_sales_all = df[['platform_name', 'na_sales', 'eu_sales', 'jp_sales']].groupby('platform_name').sum().reset_index()
platform_sales_all.head()
#pie chart for market shares
from plotly.subplots import make_subplots
import plotly.graph_objects as go
labels = ['X360', 'PS2', 'Wii', 'PS3', 'DS', 'PS', '3DS', 'other']
value1 = platform_sales_all['na_sales']
value2 = platform_sales_all['eu_sales']
value3 = platform_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
name="JP"), 2, 1)
fig.update_layout(title_text='Markets Shares of Platforms')
fig.show()
import plotly.express as px
fig = px.bar(platform_sales_all, x="platform_name", y=["na_sales", "eu_sales", "jp_sales"], title="Stacked bar chart of platform sales in 3 regions")
fig.show()
The ranking shows leading gaming platforms on different world markets by gaming revenue in the measured period. In every region the platforms perform differently but EU and NA have much more in common in comparison to Japan market. For example X360 has significant market share there wherease in EU and NA it is far from being a leader. 3PS that is an udisputable leader in NA and no.3 in EU has hardly got 1% of market in Japan.
Shares of market leaders vary from 15% to 8% in all 3 regions.
In absolute values of sales North America is the largest market and Japan is the smallest among these 3 regions. Sizes of the pie charts represent these correlation.
Thus we can conclude that there is no absolute leader that would be no. 1 platform for every region. Only PS3 managed to stay among top 3 patforms for all 3 markets.
genres_sales_na = df[['genre', 'na_sales']].groupby('genre').sum().sort_values(by='na_sales', ascending=False).reset_index()
genres_sales_na
#calculating market share for every genre
genres_sales_na['percentage_na'] = (genres_sales_na['na_sales']/genres_sales_na['na_sales'].sum())*100
genres_sales_na.head()
# NA market view for genres
genres_sales_na=genres_sales_na[genres_sales_na['na_sales']!=0]
sizes=genres_sales_na.na_sales.values.tolist()
labels = genres_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Genres Market Shares in NA')
#plt.axis('off')
plt.show()
Five top selling genres for North America are: action, sports, shooter, misc and racing.
genres_sales_eu = df[['genre', 'eu_sales']].groupby('genre').sum().sort_values(by='eu_sales', ascending=False).reset_index()
genres_sales_eu
#calculating market share for every genre
genres_sales_eu['percentage_eu'] = (genres_sales_eu['eu_sales']/genres_sales_eu['eu_sales'].sum())*100
genres_sales_eu.head()
# eu market view for genres
genres_sales_eu=genres_sales_eu[genres_sales_eu['eu_sales']!=0]
sizes=genres_sales_eu.eu_sales.values.tolist()
labels = genres_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Genres Market Shares in Europe')
#plt.axis('off')
plt.show()
genres_sales_jp = df[['genre', 'jp_sales']].groupby('genre').sum().sort_values(by='jp_sales', ascending=False).reset_index()
genres_sales_jp
#calculating market share for every genre
genres_sales_jp['percentage_jp'] = (genres_sales_jp['jp_sales']/genres_sales_jp['jp_sales'].sum())*100
genres_sales_jp.head()
# Japan market view for genres
genres_sales_jp=genres_sales_jp[genres_sales_jp['jp_sales']!=0]
sizes=genres_sales_jp.jp_sales.values.tolist()
labels = genres_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Genres Market Shares in Japan')
#plt.axis('off')
plt.show()
Five top selling genres for Japan are: role-playing, action, sports, misc and platform.
All in all we have the following leaders for the 3 regions: Action, Sports, Shooter, Misc, Racing, Platform and Role-playing. We will build a stacked bar chart to see the difference in sales more clearly.
#creating a separate column with the names of leading genres and "others"
top_genres = ['Action', 'Sports', 'Shooter', 'Misc','Racing', 'Platform', 'Role-Playing'] #list of genre leaders
df['genre_name'] = [x if x in top_genres else 'other' for x in df['genre']]
df['genre_name'].value_counts()
#Creating a table with platform sales split through the markets for the stacked bar chart
genres_sales_all = df[['genre_name', 'na_sales', 'eu_sales', 'jp_sales']].groupby('genre_name').sum().reset_index()
genres_sales_all
#pie chart for market shares
from plotly.subplots import make_subplots
import plotly.graph_objects as go
#labels = ['Action', 'Sports', 'Shooter', 'Misc', 'Racing', 'Platform', 'Role-Playing', 'other']
labels = genres_sales_all['genre_name']
value1 = genres_sales_all['na_sales']
value2 = genres_sales_all['eu_sales']
value3 = genres_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
name="JP"), 2, 1)
fig.update_layout(title_text='Markets Shares of genres')
fig.show()
import plotly.express as px
fig = px.bar(genres_sales_all, x="genre_name", y=["na_sales", "eu_sales", "jp_sales"], title="Stacked bar chart of genres sales in 3 regions")
fig.show()
The charts show leading genres across 3 different world markets. Leading genres for NA and US are almost the same, just a slight rotation of "Racing" and "Misc" on 4th and 5th places. "Action" is an undisputebale leader for both markets with market share over 20%.
In Japan almost 1/3 of the selling games are "Role-Playing"; in Europe and North America it is on the 6th place with market share of 8%. Although in absolute figures sales of this genre in Japan and America are almost the same.
"Misc" and "Platform" have the same market shares on the all 3 markets (7-9%). Although "Platform" even got to the top 5 genres in Japan.
Dramatic difference in genre preferences are illustrated also by "Shooter" and "Racing". They are absolutely not popular in Japan unlike NA and EU.
# As ESRB ratings were istablished only in 1994, we will select only data after this period.
df_after_1994 = df[df.year_of_release>=1994]
rating_sales_na = df_after_1994[['rating', 'na_sales']].groupby('rating').sum().sort_values(by='na_sales', ascending=False).reset_index()
rating_sales_na
#calculating share for every rating type
rating_sales_na['percentage_na'] = (rating_sales_na['na_sales']/rating_sales_na['na_sales'].sum())*100
rating_sales_na.head()
# rating shares in NA
rating_sales_na=rating_sales_na[rating_sales_na['na_sales']!=0]
sizes=rating_sales_na.na_sales.values.tolist()
labels = rating_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Rating Shares in North America')
#plt.axis('off')
plt.show()
The most selling games have rating E, T, M and E10 (which is a lot as this category was established only 10 years later than the rest).
rating_sales_eu = df_after_1994[['rating', 'eu_sales']].groupby('rating').sum().sort_values(by='eu_sales', ascending=False).reset_index()
rating_sales_eu
#calculating market share for every rating
rating_sales_eu['percentage_eu'] = (rating_sales_eu['eu_sales']/rating_sales_eu['eu_sales'].sum())*100
rating_sales_eu.head()
# ratings for eu market
rating_sales_eu=rating_sales_eu[rating_sales_eu['eu_sales']>1]
sizes=rating_sales_eu.eu_sales.values.tolist()
labels = rating_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Rating Market Shares in Europe')
#plt.axis('off')
plt.show()
rating_sales_jp = df_after_1994[['rating', 'jp_sales']].groupby('rating').sum().sort_values(by='jp_sales', ascending=False).reset_index()
rating_sales_jp
#calculating market share for every genre
rating_sales_jp['percentage_jp'] = (rating_sales_jp['jp_sales']/rating_sales_jp['jp_sales'].sum())*100
rating_sales_jp.head()
# Japan market view for ratings
rating_sales_jp=rating_sales_jp[rating_sales_jp['jp_sales']!=0]
sizes=rating_sales_jp.jp_sales.values.tolist()
labels = rating_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Ratings and their Market Shares in Japan')
#plt.axis('off')
plt.show()
The most popular games in Japan have rating E, games with ratings T, M, E10 are also among the market leaders.
#Creating a table with rating sales split through the markets for the stacked bar chart
rating_sales_all = df_after_1994[['rating', 'na_sales', 'eu_sales', 'jp_sales']].groupby('rating').sum().reset_index()
rating_sales_all
#pie chart for game ratings
labels = rating_sales_all['rating']
value1 = rating_sales_all['na_sales']
value2 = rating_sales_all['eu_sales']
value3 = rating_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
name="JP"), 2, 1)
fig.update_layout(title_text='Market Shares of ESRB Ratings')
fig.show()
#staked bar chart
fig = px.bar(rating_sales_all, x="rating", y=["na_sales", "eu_sales", "jp_sales"],\
title="Stacked bar chart of ESRB Ratings in 3 regions")
fig.show()
As to the ratings is concerned, European and American gamers have similar preferences. About 40% of them prefer to be on the safe side and choose games with E rating. Almost a quater in every region chooses games rated for Teens and a quater picks up Mature rated games. In EU percentage of M rated games is slightly higer.
In Japan share of games with Teens rating is considerably higher: 33% vs 24% in two other regions. Simultaneously percentage Mature rated games is 14 which is 10% lower than in NA and EU regions.
The share of games for children E10 is similar across all 3 regions, in Japan it is insignificantly lower.
All in all E rating is leading in all regions both in market share and in absolute figures.
Based on the performed data analysis we can create the following User Profiles
An average NA gamer prefers X360, PS2, Wii almost with the same likelihood. Most likely the games will be in genre Action, Sports or Shooter with rating E ("Everyone").
A European gamer has a broder range of preferences: all top 5 platforms have an even chance to be chosen: PS2, PS3, X360, Wii or PS. There is a great probability that likewise his American colleague he or she will pick up Action, Sports or Shooter with rating E.
An average Japanese gamer prefers DS, PS, PS2 most of all. And when it comes to games' choise, most likely it will be role-play or action with rating E.
Stating hypothesis (H0 has to state unchanged result):
H0: average user ratings of the Xbox One and PC platforms are the same
H1: average user ratings of the Xbox One differ from average user ratings for PC platform
#creating 2 db for XBox and PC
df_XB = df.query('platform == "XB"')
df_XB.shape
df_PS = df.query('platform == "PS"')
df_PS.shape
#Getting rid of outliers
#defining outliers with 3-sigma method for both platforms
#as we are calulating std for sample ddof is set to 1
std_score_XB = np.std(df_XB['user_score'], ddof=1)
three_sigma_score_XB_lower = round((df_XB['user_score'].mean() - std_score_XB*3),2)
#we do not use the lower fence as it will give us negative value for user scores.
three_sigma_score_XB_upper = round((df_XB['user_score'].mean() + std_score_XB*3),2)
std_score_PS = np.std(df_PS['user_score'], ddof=1)
three_sigma_score_PS_lower = round((df_PS['user_score'].mean() - std_score_PS*3),2)
#we do not use the lower fence as it will give us negative value for user scores.
three_sigma_score_PS_upper = round((df_PS['user_score'].mean() + std_score_PS*3),2)
print('99.7% of games on XB platform have rating from 0 to ', three_sigma_score_XB_upper,'. \n99.7% of of games on PS platform have rating from 0 to ', three_sigma_score_PS_upper,'.')
#setting df without outliers
df_XB_no_outliers = df_XB.query('user_score<=@three_sigma_score_XB_upper')
df_PS_no_outliers = df_PS.query('user_score<=@three_sigma_score_PS_upper')
#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_XB = np.var(df_XB_no_outliers['user_score'])
print('Variance for XB ratings sample is ', variance_XB)
variance_PS = np.var(df_PS_no_outliers['user_score'])
print('Variance for PS rating sample is ', variance_PS)
Explanation of the method choise for hypothesis testing
As we have two samples of continuous data, the samples are drawn from a normally distributed data with different variances we will conduct a Welch test to test hypothesis (in Python it will be a two-tailed T-test with correction 'equal_var = False').
A critical statistical significance level will be set at 0.05 as it is a commonly-accepted level and as we do not conduct medical testing, higher acuracy is not required
alpha = .05
# "equal_var = False" as previous calculations proved that the 2 samples have different variance value.
results = st.ttest_ind(df_XB_no_outliers['user_score'], df_PS_no_outliers['user_score'], equal_var = False)
print('p-value:', results.pvalue/2)
#we are making two-tailed test as we are checking whether the av. rating of the two platforms differ,
#no matter if one is bigger or smaller than the other
if (results.pvalue/2 < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average users score of games on Xbox One platform and average score on PC platform are not the same (μ1 != μ2).
Stating hypothesis (H0 states unchanged result):
H0: average user ratings of Action and Sports genres are the same
H1: average user ratings of Action and Sports genres are different
#creating 2 db for Action and Sports
df_Action = df.query('genre == "Action"')
df_Action.shape
df_Sports = df.query('genre == "Sports"')
df_Sports.shape
#Getting rid of outliers
#defining outliers with 3-sigma method for both platforms
#as we are calulating std for sample ddof is set to 1
std_score_Action = np.std(df_Action['user_score'], ddof=1)
three_sigma_score_Action_lower = round((df_Action['user_score'].mean() - std_score_Action*3),2)
#we do not use the lower fence as it will give us negative value for user scores.
three_sigma_score_Action_upper = round((df_Action['user_score'].mean() + std_score_Action*3),2)
std_score_Sports = np.std(df_Sports['user_score'], ddof=1)
three_sigma_score_Sports_lower = round((df_Sports['user_score'].mean() - std_score_Sports*3),2)
#we do not use the lower fence as it will give us negative value for user scores.
three_sigma_score_Sports_upper = round((df_Sports['user_score'].mean() + std_score_Sports*3),2)
print('99.7% of games of Action genre have rating from 0 to ', three_sigma_score_Action_upper,'. \n99.7% of of games of Sports genre have rating from 0 to ', three_sigma_score_Sports_upper,'.')
#setting df without outliers
df_Action_no_outliers = df_Action.query('user_score<=@three_sigma_score_Action_upper')
df_Sports_no_outliers = df_Sports.query('user_score<=@three_sigma_score_Sports_upper')
#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_Action = np.var(df_Action_no_outliers['user_score'])
print('Variance for Action ratings sample is ', variance_Action)
variance_Sports = np.var(df_Sports_no_outliers['user_score'])
print('Variance for Sports rating sample is ', variance_Sports)
As we have two samples of continuous data, the samples are drawn from a normally distributed data with close values of variances we will conduct an unpaired t-test to test the stated hypothesis (in Python it will be unpaired two-tailed T-test with 'equal_var = True').
A critical statistical significance level will be set at 0.05 as it is a commonly-accepted one.
alpha = .05
# "equal_var = True" as previous calculations proved that the 2 samples have very close variance values.
results = st.ttest_ind(df_Action_no_outliers['user_score'], df_Sports_no_outliers['user_score'], equal_var = True)
print('p-value:', results.pvalue/2)
#we are making two-tailed test as we are checking whether the av. rating of the two genres differ,
#no matter if one is bigger or smaller than the other
if (results.pvalue/2 < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average users ratings of Action and Sports genres are different (μ1 != μ2).
We have performed analysis of the game platforms data for the period 1980-1916.
As data contained a lot of missing values in some of the sections we had partly restored them. Some data containing missing values was dropped (but not more than 3%).</br>
We explored dynamics of game releases, and found some patterns: game releases boomed from 1994,
period of 2008 - 2009 was the most prolific, but was followed by a sharp decline in releases the following years.
After a slight recovery in 2014 the number of released games per year stayed stable.
Having analized total sales on 3 major markets though the above years we have figured out that the following platforms were the most successful ones:
PS2, X360, PS3, Wii, DS and PS.
They were launched in different years and reined for a while. Peak period lasted for 2-4 years followed by decline of a platform. Decline of one platform coresponded to a launch of another successor. An average life cycle of a platform is 10 years.
By 2016 all the above platforms ended or were ending their life cycle.
Relevant years for platforms analysis were picked up as 2000-2015. This period correspond to 1.5 lifecycle of a platform and could allow to make prognosis.
Sales analysis of the market leaders 2000-2016 shows that the platforms perform differently in terms of total sales. We have different median values, minimum and maximum values, the their sales distribution is scewed in different ways.
To conclude, the statistics of sales of leading platforms have significant differences.
Also we have figured out that critis scores of the games have very strong positive linear correlation with the total sales.
Correlation with user rating is usually weak, negative and not stable through the platforms.
As to the game distribution by genres, the charts show that the most profitable genres are: action, sports and and shooter. Puzzle and strategy have brought the least amount of money.
Despite the fact that these 2 genres look like outsiders, they were very popular back in their days.
Puzzle had a very bright but short peak in 2007-2008 and Strategy emerged before 1990 and had several peaks during its cycle. Distributions for these 2 genres have thicker tails to the left.
Distributions of the market leaders have hicker tails to the right.
Analizing platforms behaviour on 3 main markets we can see that every region have their own favourits. Genres popularity on these markets makes it even more obvious that every market has its specifics due to different economical and cultural background.