The aim of the project is to identify patterns that determine whether a game succeeds or not. This allows to spot potential big winners and plan advertising campaigns.
The data referrs to 2016.

We will test the following hypotheses:

if the difference in average user ratings of the Xbox One and PC is statistically significant;
if the difference in average user ratings for the Action and Sports genres is statistically significant. Finally we will draw basic conclusions and develop some recommendations.

Table of contents:¶

Part 1. Data preprocessing

Part 2. Exploratory data analysis (EDA)
Defining platforms leading in sales
Platforms with the greatest total sales and distribution based on data for each year
Platforms that used to be popular but now have zero sales
Defining significant period for the data
Defining potentially profitable platforms
Box plot for the global sales of all games, broken down by platform
Correlation between reviews and sales
General distribution of games by genre
Creation of a user profile for each region
Top 5 genres by regions
Part 3. Hypotheses testing

General conclusion

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from functools import reduce
from scipy import stats as st
import numpy
from matplotlib import pyplot
import plotly 
import squarify

import seaborn as sns

df = pd.read_csv('./games.csv', sep=',')

Part 1. Data preprocessing ¶

df.head()

df.sample(6)

# Map the lowering function to all column names
df.columns = map(str.lower, df.columns)
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.1+ MB

#stasistical summary of the data
df.describe()

#Checking data for zeros:
for i in df.columns:
    print(i, len(df[df[i]==0]))

name 0
platform 0
year_of_release 0
genre 0
na_sales 4508
eu_sales 5870
jp_sales 10514
other_sales 6601
critic_score 0
user_score 0
rating 0

#statistical summary for categorical variables
df.describe(include=['object'])

df['platform'].value_counts()

PS2     2161
DS      2151
PS3     1331
Wii     1320
X360    1262
PSP     1209
PS      1197
PC       974
XB       824
GBA      822
GC       556
3DS      520
PSV      430
PS4      392
N64      319
XOne     247
SNES     239
SAT      173
WiiU     147
2600     133
GB        98
NES       98
DC        52
GEN       29
NG        12
SCD        6
WS         6
3DO        3
TG16       2
GG         1
PCFX       1
Name: platform, dtype: int64

Our initial data base has 16715 entries and 11 columns.

Data represents a period of 1980-2016 with 11559 unique names of the games in 12 genres and 31 platforms.

The most frequintly mentioned game is 'Need for Speed', platform - PS2, genre - Action. We do not have missing values in 4 columns of sales data. 'critic_score', 'user_score', 'rating' have the highest percentage of missing values, over 40%.

As far as the zeros in data is concerned, we can see that some of the games were sold only in NA, only in Europe or only in Japan.

Missing values ¶

# defining columns with missing values 
missing_list=[]
for x in df:
    if len(df[df[x].isnull()])>0:
        missing_list.append(x)
print('The following columns have missing values: ', missing_list)

The following columns have missing values:  ['name', 'year_of_release', 'genre', 'critic_score', 'user_score', 'rating']

# calculating percentage of missing values in every column
missing_percentage=[]
for x in missing_list:
    missing_percentage.append([x,(len(df[df[x].isnull()])/len(df))])
missing_percentage=pd.DataFrame(missing_percentage, columns=['column','missing_values_share'])
missing_percentage.sort_values(by=['missing_values_share'], ascending=False)

df.notnull().sum()

name               16713
platform           16715
year_of_release    16446
genre              16713
na_sales           16715
eu_sales           16715
jp_sales           16715
other_sales        16715
critic_score        8137
user_score         10014
rating              9949
dtype: int64

df.isnull().any()

name                True
platform           False
year_of_release     True
genre               True
na_sales           False
eu_sales           False
jp_sales           False
other_sales        False
critic_score        True
user_score          True
rating              True
dtype: bool

#at least one missing value in any column
df[df.isnull().any(axis=1)]

#the subset of our data that has columns without at least 1 missing value
df[df.notnull().any(axis=1)]

#the subset of our data with columns that do not have missing values at all
df[df.notnull().all(axis=1)]

As we have only few missing values in the columns 'year_of_release', 'name' and 'genre' (all these lines together are <2% of the data base), we can drop them to go further with our calculations and this will not influence our study results. As 'name' and 'genre' have the same number of missing vaues, it might be enter data error.

df = df.dropna(axis=0, subset=['name', 'year_of_release', 'genre'])
df.head()

#changing value type to integer to get rid of decimals in the column 'year_of_release'
df['year_of_release'] = df['year_of_release'].astype(int)
df['year_of_release'].head()

0    2006
1    1985
2    2008
3    2009
4    1996
Name: year_of_release, dtype: int32

df.groupby(['year_of_release']).size().plot(
    kind='bar',
    grid=True,
    title='Number of data without NaNs',
    figsize=(10,5)
)
plt.show()

As we can see most of the missing data fall into period 1980-1994, and also there is some fluctuation across the entire sample period. To mitigate the effect of missing data, we will use grouping and transform method where it is applicable.

We have over 40% of missing values in 'critic_score' and 'user_score'.

We will fill Nans for multiplatform games based on the scores of these games on another platforms.

df['critic_score'] = df['critic_score'].fillna(df.groupby(['name'])['critic_score'].transform('first'))

crit_score000 = df['critic_score'].isna().sum()/len(df['critic_score']) 
print('Percentage of missing values is {: .2%}'.format(crit_score000))

Percentage of missing values is  45.74%

The percentage of missing values has slightly dropped.

'user_score' column
As the data type here is object we cannot apply the same method for filling nans with median.

df['user_score'].value_counts()

tbd    2376
7.8     322
8       285
8.2     276
8.3     252
       ... 
0.3       2
1         2
0.5       2
0         1
9.7       1
Name: user_score, Length: 96, dtype: int64

#checking the data types in this column
df.user_score.apply(type).unique()

array([<class 'str'>, <class 'float'>], dtype=object)

Supposably 'tbd' has string type and the rest of the data here is float. We will check the percentage of TBDs and the rest values.

#defining share of every value
df['user_score'].value_counts()/len(df)

tbd    0.144490
7.8    0.019582
8      0.017332
8.2    0.016784
8.3    0.015325
         ...   
0.3    0.000122
1      0.000122
0.5    0.000122
0      0.000061
9.7    0.000061
Name: user_score, Length: 96, dtype: float64

We can see that there is a 15% of data named TBD. We will have a look at what info contain lines with TBD value.

df[df.user_score=='tbd']

df[df.user_score=='tbd']['year_of_release'].value_counts()

2009    405
2010    331
2008    326
2011    217
2007    192
2002    192
2006    124
2005    121
2004    107
2003     95
2001     82
2000     43
2015     38
2016     34
2012     24
2014     21
2013     15
1999      8
1997      1
Name: year_of_release, dtype: int64

There is no clear pattern in this data, it refers to different years, platform, genres, etc. Obviously that the data cannot be restored. We will replace TBD with 'nan' to restore it further for the muliplatform games. Afterwords we will be able to change the data type.

#replasing tbd with nan
df['user_score'] = df['user_score'].replace('tbd', np.nan)

user_score_nan = df['user_score'].isna().sum()/len(df['user_score'])
print('Percentage of missing values is {: .2%}'.format(user_score_nan))

Percentage of missing values is  54.62%

The persantage of nans has risen to 54%.

Now we are filling Nans for multiplatform games based on the user's scores of these games on another platforms.

df['user_score'] = df['user_score'].fillna(df.groupby(['name'])['user_score'].transform('first'))
user_score000 = df['user_score'].isna().sum()/len(df['user_score']) 
print('Percentage of missing values is {: .2%}'.format(user_score000))

Percentage of missing values is  48.16%

#filling nans with 'o' value to change the data type later on. 
df['user_score'] = df['user_score'].fillna(0)

#changing the data type to round up the score and reduce the number of categories. 
df['user_score'] = df['user_score'].astype(float)
df['user_score'].value_counts()

0.0    7921
7.8     374
8.0     340
8.2     304
8.3     281
       ... 
0.6       2
0.9       2
2.3       2
9.7       2
9.6       2
Name: user_score, Length: 95, dtype: int64

'rating' column

df['rating'].value_counts()

E       3921
T       2905
M       1536
E10+    1393
EC         8
K-A        3
AO         1
RP         1
Name: rating, dtype: int64

df['rating'].unique()

array(['E', nan, 'M', 'T', 'E10+', 'K-A', 'AO', 'EC', 'RP'], dtype=object)

We have over 40% of missing values here. We will leave the data as is.

#The data after treating the missing values
df.groupby(['year_of_release']).size().plot(
    kind='bar',
    title='Number of data without NaNs',
    figsize=(10,5)
)
plt.show()

Total sales¶

#calculating the sales across the different regions
df['total_sales']= df[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
grouped_sales = df.groupby('year_of_release')['total_sales'].max().sort_values(ascending=False).reset_index()
grouped_sales

Sales of games saw a steady growth though the years from USD 3.5 mln in 1985 to USD 82.54 mnl in 2006. Obviously the year 2006 was the most lucrative one.

Part 2. Data Analysis ¶

Games release by years¶

#number of games released yearly
release_early = df.groupby('year_of_release')['name'].count().sort_values().reset_index()
release_early.head()

fig,ax=plt.subplots(figsize=(13,10))
ax.vlines(x=release_early.year_of_release, ymin=0, ymax=release_early.name, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=release_early.year_of_release, y=release_early.name, s=75, color='lightgreen', alpha=0.7)


ax.set_title('Lolipop Chart for Released Games', fontdict={'size':16})
ax.set_ylabel('Nuber of game releases')
ax.set_xticks(release_early.year_of_release)
ax.set_xticklabels(release_early.year_of_release,rotation=90, fontdict={'horizontalalignment':'right','size':14})

for  row in release_early.itertuples():
    ax.text(row.year_of_release, row.name+30,s=round(row.name,2), rotation=90, fontdict={'horizontalalignment':'center','size':10})

Game releases boomed from 1994 and reached its peak at slightly over 1400 games per year in 2008 - 2009, during the following years number of releases decreased. The strongest fall was in 2012 (from over 1100 to slightly less than 700) and since then number of releases stayed stable between 500 to 600 releases per year. Presumedly data from the first years of game production (1985-1994) is fractional and till year 2001 it is not significat due to the minor amount of games released.

Defining platforms leading in sales ¶

Sorting out the platforms with biggest amount of total sales.

platform_sales = df[['platform', 'total_sales']].groupby('platform').sum().sort_values(by='total_sales', ascending=False).reset_index()
platform_sales

#visualizing total sales per platform
(df.groupby('platform')
 .agg({'total_sales': sum})
 .plot(y='total_sales', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()

We will define how successful is the platform in relation to the mean value of sales. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the score of total saleas of a platform is identical to the mean score.

#counting z-score for sales to define distribution 
platform_sales['sales_z']=(platform_sales['total_sales']-platform_sales['total_sales'].mean())/platform_sales['total_sales'].std()
platform_sales.head()

#distinguishing z-score by colors: red for negative values, blue for positive ones 
platform_sales['colors'] = ['red' if x<0 else 'purple' for x in platform_sales['sales_z']]
platform_sales['colors']
 
plt.figure(figsize=(14,10))
plt.hlines(y=platform_sales.platform, xmin=0, xmax=platform_sales.sales_z, colors=platform_sales.colors, alpha=0.4, linewidth=10)

<matplotlib.collections.LineCollection at 0xb23ce20>

The graph clearly shows that the most successful platforms with sales above average are to the right (in purple) and less successful are to the left (in red) of the chart.

Platforms with the greatest total sales and distribution based on data for each year ¶

As we can see the last graph and table the top 6 leading platforms based on the total sales are PS2, X360, PS3, Wii, DS and PS. The rest are far behind the leaders. All the above with total sales over USD 700 mln. Distribution will be built for these platforms.

#nlargest method for defining top 6 successful games
df.nlargest(6,['total_sales'])

Analysis of the most successful games prove our choise of the most successful platforms.

#nsmallest method for defining 5 least popular games
df.nsmallest(5,['total_sales'])

# Drawing Plot of distributoion of sales for leading platforms
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['platform'] == "PS2", 'year_of_release'], shade=True, color="g", label="PS2", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "X360", 'year_of_release'], shade=True, color="deeppink", label="X360", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "PS3", 'year_of_release'], shade=True, color="dodgerblue", label="PS3", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "Wii", 'year_of_release'], shade=True, color="red", label="Wii", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "DS", 'year_of_release'], shade=True, color="orange", label="DS", alpha=.7)
sns.kdeplot(df.loc[df['platform'] == "PS", 'year_of_release'], shade=True, color="yellow", label="PS", alpha=.7)

# Decoration
plt.title('Distribution of total sales per year for 6 top platforms', fontsize=22)
plt.legend()
plt.show()

Below we will illustrate this 3-diamentional data with a heatmap chart.

df_select_platforms = df.query('platform == "PS2" or platform == "X360" or platform == "PS3" or platform == "Wii" or platform == "DS" or platform == "PS"')
df_heatmap_platforms=pd.pivot_table(df_select_platforms, index='year_of_release', columns='platform', values='total_sales', aggfunc=sum, fill_value=0)
df_heatmap_platforms

plt.figure(figsize=(13,9))
sns.heatmap(df_heatmap_platforms.T, cmap="RdBu_r")

<matplotlib.axes._subplots.AxesSubplot at 0xb27ae50>

The heatmap allows us to have another angle of view on the data and get deeper comprehension of the processes.

Buding distribution chart for every platform one by one separately

#1st platform
top_6_platforms_PS2 = df.query('platform=="PS2"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
top_6_platforms_PS2.head(3)

#chart
sns.distplot(top_6_platforms_PS2['year_of_release'], label = 'PS2', color = 'g', bins=12)

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0xb2f3118>

#2nd platform
top_6_platforms_X360 = df.query('platform=="X360"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_X360['year_of_release'], label = 'X360', color = 'deeppink', bins=10)

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x91bf448>

#3rd platform
top_6_platforms_PS3 = df.query('platform=="PS3"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_PS3['year_of_release'], label = 'PS3', color = 'b')

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0xb2a3f58>

#4th platform
top_6_platforms_Wii = df.query('platform=="Wii"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_Wii['year_of_release'], label = 'Wii', color = 'r')

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0xab6ff10>

#5th platform
top_6_platforms_DS = df.query('platform=="DS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_DS['year_of_release'], label = 'DS', color = 'orange', bins=10)

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x9dbc7a8>

#surching for outlier
DS_outlier = df.query('platform=="DS" and year_of_release <1990')
DS_outlier

A quick search gives us information, that development on the Nintendo DS began around mid-2002. Thus we can drop the line with game name "Shogi DS" as it is obviously not connected to the popular game platform.

df=df.drop(index=15957)

top_6_platforms_DS_no_outlier = df.query('platform=="DS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_DS_no_outlier['year_of_release'], label = 'DS', color = 'orange', bins=8)

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x9e0e9b8>

plt.figure(figsize=(6,4), dpi= 80)

sns.kdeplot(df.loc[df['platform'] == "DS", 'year_of_release'], shade=True, color="orange", label="DS", alpha=.7)

# Decoration
plt.title('Distribution of total sales per year for SD platform', fontsize=22)
plt.legend()
plt.show()

#6th platform
top_6_platforms_PS = df.query('platform=="PS"').groupby('year_of_release').sum().sort_values(by='total_sales').reset_index()
sns.distplot(top_6_platforms_PS['year_of_release'], label = 'PS', color = 'yellow', bins=4)

C:\Users\HP\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0xac91688>

The charts illustrate that every platform had their life cycle that lasted aproximately 10 years. Some patforms have several peaks in the middle corresponding to the launch of new games.

We are going to build a treemap to get a general picture of distribution of the market.

df_tree = df[['platform', 'total_sales']].groupby('platform').sum().sort_values(by='total_sales', ascending=False).reset_index()
df_tree=df_tree[df_tree['total_sales']>=15]

sizes=df_tree.total_sales.values.tolist()
labels = df_tree.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Treemap of Platforms Total Sales')
#plt.axis('off')
plt.show()

Our 6 markets leaders are all in shades of purple to the left and bottom part of the chart.

Find platforms that used to be popular but now have zero sales. How long does it generally take for new platforms to appear and old ones to fade? ¶

All the most popular platforms with the bigest amount of sales by year 2016 demonstrate drop of sales and according to the graphs their life cycle or has ended or heading to an end. Judging by the leaders of the market the period of time from launching the platform to its end is about 9-10 years.

PS platform, the first of the leaders, emerged in 1993 and completely stopped selling by 2004. The top sells were reached by 1998. Decline of sales started right after 1998, that corresponds with the launch of a new future leader (PS2) and boom of its sells . All in all it took 10 years for a platform to get from the first to the last one.

PS2, the top selling platform, did not have such sharp peaks as its predecessor, but it took them only about 3-4 years to get to the top of their sales and then for 4 years their sales were steady and high without any declines. PS2 had never reached the highest peaks of PS sales but in lenght of its stable peak it had no rivals. Decline in sales started in 2005 as 4 other future market leaders were gaining popularity. As to Wikipedia ["On November 29, 2005, the PS2 became the fastest game console to reach 100 million units shipped, accomplishing the feat within 5 years and 9 months from its launch. This achievement occurred faster than its predecessor, the PlayStation, which took 9 years and 6 months since launch" to reach the same figure."](https://en.wikipedia.org/wiki/PlayStation#PlayStation). PS production ended in 2012 thus the platform was active for 12 years. By year 2015 it still kept the rank of the best selling console [of all the time](https://www.theguardian.com/technology/2013/jan/04/playstation-2-manufacture-ends-years?INTCMP=SRCH)

DS released globally across 2004 - 2005. In 2014 production ended, it was added to the Wii

PS3 was released 2006,competed with X360 and Wii launched a year before. It had never reached the amount of sales its main rivals. Shipments of new units ended in 2016. The platform lived for 10 years.

Wii was launched in 2006. From 2013, Internet services were gradually discontinued; and Wii could no longer be purchased after 2018. The platform lived for 12 years.

X360 was launched across 2005–2006 and in 2016 its production ended. 10 years of sales.

The leading platforms had completed their launch in the key markets within one year. After growth of sales they reached their peak and after emrging new sucessful rivals the old ones headed to decline that lasted about 4-5 years.

We are going to build a treemap to investigate life cycle of the platforms.

Defining significant period for the data. ¶

Taking into account quality of data and amount of sales in the first years of game production we should not take into account data before year 2001. As life cycle of a platform is 10 years on average, data starting from 2001 also will sute the purposes of analysis because this period includes 1.5 of a lifecycle time. This will also be sufficient for building prognosis for 2017.

Which platforms are growing or shrinking. Several potentially profitable platforms. ¶

Building a shiftmap for platforms. Taking into account only the ones released after 2000.

df_after_2000 = df[df.year_of_release>2000]
df_after_2000.head()

df1=pd.pivot_table(df_after_2000, index='year_of_release', columns='platform', values='total_sales', aggfunc=sum, fill_value=0)
df1

df1.shift(+1).tail()

#calculating dynamic of sales
sales_dynamics = df1-df1.shift(+1)
sales_dynamics.tail()

#illustrating the dynamics
plt.figure(figsize=(13,9))
sns.heatmap(sales_dynamics.T, cmap="RdBu_r")
plt.show()

AS to the above calculations all the platforms have drop in sales in 2016 in comparison to 2015. Life cycle of the most platforms has ended by 2016. Either the companies have already ended production of new hardware or their sales are dropping. Based on the current data we cannot choose any potentially profitable platforms.

Box plot for the global sales of all games, broken down by platform ¶

grouped = df_after_2000.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped

ordered=grouped.groupby(['platform'])['total_sales'].sum().sort_values().reset_index()['platform']
ordered

0       WS
1       DC
2      N64
3       GB
4       PS
5      PSV
6     WiiU
7     XOne
8       GC
9       PC
10      XB
11     3DS
12     PSP
13     GBA
14     PS4
15      DS
16     Wii
17     PS3
18    X360
19     PS2
Name: platform, dtype: object

plt.figure(figsize=(13,10))
sns.boxplot(x='platform', y='total_sales', data=grouped, order=ordered)

<matplotlib.axes._subplots.AxesSubplot at 0xafac598>

The boxplot chart clearly illustrates that platforms perform dramatically different as to their total sales. To the right we have almost the same group of 6 market leaders. As we have chosen only the platformes releases after 2000 year, PS platform has not made it to the group of leaders (most of its games were released well before), but PS4 has.
We will look into leaders' performance in a more detailed way.

#picking leaders
df_after_2000_leaders = df_after_2000[df_after_2000.platform.isin(("PS2", "X360", "PS3", "Wii", "DS"))]
df_after_2000_leaders.head()

grouped_leaders = df_after_2000_leaders.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_leaders.head()

import plotly.express as px

fig = px.box(grouped_leaders, x="platform", y="total_sales", color="platform")
fig.update_traces(quartilemethod="exclusive") 
fig.show()

Median values for leaders also vary greatly: from USD 59.65 mln for Wii to 103.42 mln for PS2. </br>
PS2 is a leader with the highest value of median, biggest maximum value; the middle 50% of sales is between 26.4 mln and 184.31 mln.

Wii platform has the lowest value of median among the leaders, distribution of its sales has a long upper wisker and the distribution is scewed to the right. Middle 50% of its sales were between 3.75 and 152.77 mln.

The least succesful among these leaders is DS platform with median value of 102,28 mln, maximum value of 146.94 mln and middle 50% between 17.27 mln and 130.14 mln.
In general the difference in sales is significant even between the leaders.

How user and professional reviews affect sales for a popular platform. Correlation between reviews and sales. ¶

We will look at data for PS2 as the most popular platform as for 2015.

review_correlation_PS2 = df.query('platform=="PS"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
review_correlation_PS2.head()

grouped_correlation_PS2 = review_correlation_PS2.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
grouped_correlation_PS2.head()

#building scatter plots in pairs for the chosen parameters
pd.plotting.scatter_matrix(grouped_correlation_PS2, figsize=(11, 11)) 
plt.show()

corrMatrix = grouped_correlation_PS2.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

Both the scatterplot and the matrix show that there is a strong linear connection between critic scors and total sales of the platform games. User score and total sales have weak negative relationship.

Comparison of the sales for other platforms.¶

#we will calculate correlation for the least successful of the market leaders, DS
review_correlation_DS = df.query('platform=="DS"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_DS = review_correlation_DS.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_DS.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

The second matrix shows the same pattern: here is a strong positive correlation between critic scors and total sales of the platform games and week negative relationship between sales and users scores. And there is almost no correlation between users and critic scores.

#corr for a middle outsider, WiiU
review_correlation_WiiU = df.query('platform=="WiiU"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_WiiU = review_correlation_WiiU.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_WiiU.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

Here we can see that correlation between critic scors and total sales of the platform games is much weaker, almost like user score. Users and critic scores have strong degree of relationship here.

#corr for one of the least successful platform, DC
review_correlation_DC = df.query('platform=="DC"').groupby('user_score').sum().sort_values(by='total_sales').reset_index()
grouped_correlation_DC = review_correlation_DC.groupby(['user_score', 'critic_score'])['total_sales'].sum().reset_index()
corrMatrix = grouped_correlation_DC.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

For an outsider correlation between critic scors and total sales is weak, user scores have strong negative correlation with the sales. Users and critic scores also have strong positive correlation.

All in all for all groups of platforms we can see that linear correlation between the 3 factors is not stable through the platforms.

General distribution of games by genre ¶

df['genre'].value_counts()

Action          3306
Sports          2306
Misc            1721
Role-Playing    1481
Shooter         1296
Adventure       1293
Racing          1226
Platform         878
Simulation       857
Fighting         837
Strategy         673
Puzzle           569
Name: genre, dtype: int64

# As genre is a  categorical parameter, distribution will be presented by a bar chart.
(df.groupby('genre')
 .agg({'genre': ['count']})
 .plot(y='genre', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()

# Distribution of total sales by genre 
(df.groupby('genre')
 .agg({'total_sales': sum})
 .plot(y='total_sales', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()

grouped_genres = df.groupby(['genre', 'year_of_release'])['total_sales'].sum().reset_index()
grouped_genres

# General market view from the total sales point of view 
df_genres_tree = df[['genre', 'total_sales']].groupby('genre').sum().sort_values(by='total_sales', ascending=False).reset_index()
df_genres_tree=df_genres_tree[df_genres_tree['total_sales']!=0]

sizes=df_genres_tree.total_sales.values.tolist()
labels = df_genres_tree.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Treemap of Genres Total Sales')
plt.show()

Judging by total sales though all the years of analysis the most lucrative genres are: action, sports and and shooter. Puzzle and strategy have brought the least amount of money.

#Distribution of genres through the years
# Drawing Plot of distributoion for leading platforms
plt.figure(figsize=(16,10), dpi= 80)
sns.set_palette("Set2")

sns.kdeplot(df.loc[df['genre'] == "Action", 'year_of_release'], shade=True, label="Action", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Sports", 'year_of_release'], shade=True, label="Sports", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Shooter", 'year_of_release'], shade=True, label="Shooter", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Racing", 'year_of_release'], shade=True, label="Racing", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Simulation", 'year_of_release'], shade=True, label="Simulation", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Fighting", 'year_of_release'], shade=True, label="Fighting", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Platform", 'year_of_release'], shade=True, label="Platform", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Strategy", 'year_of_release'], shade=True, label="Strategy", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Puzzle", 'year_of_release'], shade=True, label="Puzzle", alpha=.5)
# Decoration
plt.title('Distribution of Games by Genre through 1985-2016', fontsize=22)
plt.legend()
plt.show()

#Having a look at the 3 most profitable and 2 the least profitable genres

plt.figure(figsize=(16,10), dpi= 80)
sns.set_palette("Set2")

sns.kdeplot(df.loc[df['genre'] == "Action", 'year_of_release'], shade=True, label="Action", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Sports", 'year_of_release'], shade=True, label="Sports", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Shooter", 'year_of_release'], shade=True, label="Shooter", alpha=.7)

sns.kdeplot(df.loc[df['genre'] == "Strategy", 'year_of_release'], shade=True, label="Strategy", alpha=.7)
sns.kdeplot(df.loc[df['genre'] == "Puzzle", 'year_of_release'], shade=True, label="Puzzle", alpha=.7)
# Decoration
plt.title('Distribution of Games by Genre: sales leaders and losers', fontsize=22)
plt.legend()
plt.show()

Below is presentation of the data on the heatmap.

df_select_genre = df.query('genre == "Action" or genre == "Sports" or genre == "Shooter" or genre == "Strategy" or genre == "Puzzle"')
df_heatmap_genres=pd.pivot_table(df_select_genre, index='year_of_release', columns='genre', values='total_sales', aggfunc=sum, fill_value=0)
df_heatmap_genres

plt.figure(figsize=(13,9))
sns.heatmap(df_heatmap_genres.T, cmap="RdBu_r")
plt.show()

All of the genres are not new, almost all of them emerged in the early 1990th.</br>
If we look at the distribution of the genres throught the years we will see that strategy games emerged at the very beginning of 1990th together with fighting. Puzzles were on the peak of their popularity in 2007-2008 but then quickly lost their leading position. Misc and aventure stepped in by 2010.

Adventure slowly but steadily gain popularity starting from 1995. This genre reached its peak by 2008 as well. Although it was almost twise less popular, the number of its fans stayed stable till the year of analysis, 2016.

As to the sales leader, Action, it boomed by 2008 and keept this level of popularity by 2012. That corresponds to the general market trend with games boom in 2008-2009 and market fall in 2012. After a short decrease it seems to be gaining popularity again and in 2015 it is an undisputable leader among other genres.

Shooter genre has a longer history of success: first peak in 1996 with a slight decrease in 1998, second impessive peak in 2004 that lasted till 2009. And now, in 2015 this genre is the 2nd most popular one despite its steady fall since 2009-2010.

Sports genre had two peaks of poopularity: 2001-2003 and 2008-2009, by 2015 the interest in this genre seems to get a fresh impulse.
All of the market leaders' distriibutions have thicker tails to the right.

Creation of a user profile for each region ¶

Top 5 platforms by regions¶

NA sales¶

platform_sales_na = df[['platform', 'na_sales']].groupby('platform').sum().sort_values(by='na_sales', ascending=False).reset_index()
platform_sales_na

#calculating market share for every platform
platform_sales_na['percentage_na'] = (platform_sales_na['na_sales']/platform_sales_na['na_sales'].sum())*100

platform_sales_na.head()

# NA market view for platforms 
platform_sales_na=platform_sales_na[platform_sales_na['na_sales']>=6]

sizes=platform_sales_na.na_sales.values.tolist()
labels = platform_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Platforms Market Shares in NA')
#plt.axis('off')
plt.show()

Five top selling platforms for North America are: X360, PS2, Wii, PS3, DS.

EU sales¶

#creating a table
platform_sales_eu = df[['platform', 'eu_sales']].groupby('platform').sum().sort_values(by='eu_sales', ascending=False).reset_index()
#calculating market share for every platform
platform_sales_eu['percentage_eu'] = (platform_sales_eu['eu_sales']/platform_sales_eu['eu_sales'].sum())*100

platform_sales_eu.head()

# EU market view for platforms 
platform_sales_eu=platform_sales_eu[platform_sales_eu['eu_sales']>=6]

sizes=platform_sales_eu.eu_sales.values.tolist()
labels = platform_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Platforms Market Shares in EU')
#plt.axis('off')
plt.show()

Five top selling platforms for Europe are: PS2, PS3, X360, Wii, PS.

Japan sales¶

#creating a table
platform_sales_jp = df[['platform', 'jp_sales']].groupby('platform').sum().sort_values(by='jp_sales', ascending=False).reset_index()
#calculating market share for every platform
platform_sales_jp['percentage_jp'] = (platform_sales_jp['jp_sales']/platform_sales_jp['jp_sales'].sum())*100

platform_sales_jp.head()

# Japan market view for platforms 
platform_sales_jp=platform_sales_jp[platform_sales_jp['jp_sales']>=3]

sizes=platform_sales_jp.jp_sales.values.tolist()
labels = platform_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Platforms Market Shares in Japan')
#plt.axis('off')
plt.show()

Five top selling platforms in Japan are: DS, PS, PS2, 3DS, PS3.

All in all we have the following leaders for the 3 regions: X360, PS2, Wii, PS3, DS, PS, 3DS. We will build a stacked bar chart to illustrate the difference in sales more clearly.

#creating a separate column with the names of leading platforms and "others" 
top_platforms = ['X360', 'PS2', 'Wii', 'PS3', 'DS', 'PS', '3DS'] #list of leaders 
df['platform_name'] = [x if x in top_platforms else 'other' for x in df['platform']]
df['platform_name'].value_counts()

other    6670
PS2      2127
DS       2120
PS3      1306
Wii      1286
X360     1232
PS       1190
3DS       512
Name: platform_name, dtype: int64

#Creating a table with platform sales split through the markets for the stacked bar chart
platform_sales_all = df[['platform_name', 'na_sales', 'eu_sales', 'jp_sales']].groupby('platform_name').sum().reset_index()
platform_sales_all.head()

#pie chart for market shares
from plotly.subplots import make_subplots
import plotly.graph_objects as go

labels = ['X360', 'PS2', 'Wii', 'PS3', 'DS', 'PS', '3DS', 'other']
value1 = platform_sales_all['na_sales']
value2 = platform_sales_all['eu_sales']
value3 = platform_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'},    {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
                     name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
                     name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
                     name="JP"), 2, 1)

fig.update_layout(title_text='Markets Shares of Platforms')
fig.show()

import plotly.express as px

fig = px.bar(platform_sales_all, x="platform_name", y=["na_sales", "eu_sales", "jp_sales"], title="Stacked bar chart of platform sales in 3 regions")
fig.show()

The ranking shows leading gaming platforms on different world markets by gaming revenue in the measured period. In every region the platforms perform differently but EU and NA have much more in common in comparison to Japan market. For example X360 has significant market share there wherease in EU and NA it is far from being a leader. 3PS that is an udisputable leader in NA and no.3 in EU has hardly got 1% of market in Japan.

Shares of market leaders vary from 15% to 8% in all 3 regions.

In absolute values of sales North America is the largest market and Japan is the smallest among these 3 regions. Sizes of the pie charts represent these correlation.

Thus we can conclude that there is no absolute leader that would be no. 1 platform for every region. Only PS3 managed to stay among top 3 patforms for all 3 markets.

Top 5 genres by regions ¶

NA sales¶

genres_sales_na = df[['genre', 'na_sales']].groupby('genre').sum().sort_values(by='na_sales', ascending=False).reset_index()
genres_sales_na

#calculating market share for every genre
genres_sales_na['percentage_na'] = (genres_sales_na['na_sales']/genres_sales_na['na_sales'].sum())*100

genres_sales_na.head()

# NA market view for genres 
genres_sales_na=genres_sales_na[genres_sales_na['na_sales']!=0]

sizes=genres_sales_na.na_sales.values.tolist()
labels = genres_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Genres Market Shares in NA')
#plt.axis('off')
plt.show()

Five top selling genres for North America are: action, sports, shooter, misc and racing.

EU sales¶

genres_sales_eu = df[['genre', 'eu_sales']].groupby('genre').sum().sort_values(by='eu_sales', ascending=False).reset_index()
genres_sales_eu

#calculating market share for every genre
genres_sales_eu['percentage_eu'] = (genres_sales_eu['eu_sales']/genres_sales_eu['eu_sales'].sum())*100

genres_sales_eu.head()

# eu market view for genres 
genres_sales_eu=genres_sales_eu[genres_sales_eu['eu_sales']!=0]

sizes=genres_sales_eu.eu_sales.values.tolist()
labels = genres_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Genres Market Shares in Europe')
#plt.axis('off')
plt.show()

Japan sales¶

genres_sales_jp = df[['genre', 'jp_sales']].groupby('genre').sum().sort_values(by='jp_sales', ascending=False).reset_index()
genres_sales_jp

#calculating market share for every genre
genres_sales_jp['percentage_jp'] = (genres_sales_jp['jp_sales']/genres_sales_jp['jp_sales'].sum())*100

genres_sales_jp.head()

# Japan market view for genres 
genres_sales_jp=genres_sales_jp[genres_sales_jp['jp_sales']!=0]

sizes=genres_sales_jp.jp_sales.values.tolist()
labels = genres_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Genres Market Shares in Japan')
#plt.axis('off')
plt.show()

Five top selling genres for Japan are: role-playing, action, sports, misc and platform.

All in all we have the following leaders for the 3 regions: Action, Sports, Shooter, Misc, Racing, Platform and Role-playing. We will build a stacked bar chart to see the difference in sales more clearly.

#creating a separate column with the names of leading genres and "others" 
top_genres = ['Action', 'Sports', 'Shooter', 'Misc','Racing', 'Platform', 'Role-Playing'] #list of genre leaders 
df['genre_name'] = [x if x in top_genres else 'other' for x in df['genre']]
df['genre_name'].value_counts()

other           4229
Action          3306
Sports          2306
Misc            1721
Role-Playing    1481
Shooter         1296
Racing          1226
Platform         878
Name: genre_name, dtype: int64

#Creating a table with platform sales split through the markets for the stacked bar chart
genres_sales_all = df[['genre_name', 'na_sales', 'eu_sales', 'jp_sales']].groupby('genre_name').sum().reset_index()
genres_sales_all

#pie chart for market shares
from plotly.subplots import make_subplots
import plotly.graph_objects as go

#labels = ['Action', 'Sports', 'Shooter', 'Misc', 'Racing', 'Platform', 'Role-Playing', 'other']
labels = genres_sales_all['genre_name']
value1 = genres_sales_all['na_sales']
value2 = genres_sales_all['eu_sales']
value3 = genres_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'},    {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
                     name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
                     name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
                     name="JP"), 2, 1)

fig.update_layout(title_text='Markets Shares of genres')
fig.show()

import plotly.express as px

fig = px.bar(genres_sales_all, x="genre_name", y=["na_sales", "eu_sales", "jp_sales"], title="Stacked bar chart of genres sales in 3 regions")
fig.show()

The charts show leading genres across 3 different world markets. Leading genres for NA and US are almost the same, just a slight rotation of "Racing" and "Misc" on 4th and 5th places. "Action" is an undisputebale leader for both markets with market share over 20%.

In Japan almost 1/3 of the selling games are "Role-Playing"; in Europe and North America it is on the 6th place with market share of 8%. Although in absolute figures sales of this genre in Japan and America are almost the same.

"Misc" and "Platform" have the same market shares on the all 3 markets (7-9%). Although "Platform" even got to the top 5 genres in Japan.

Dramatic difference in genre preferences are illustrated also by "Shooter" and "Racing". They are absolutely not popular in Japan unlike NA and EU.

ESRB ratings and sales in individual regions¶

NA sales¶

# As ESRB ratings were istablished only in 1994, we will select only data after this period. 
df_after_1994 = df[df.year_of_release>=1994]

rating_sales_na = df_after_1994[['rating', 'na_sales']].groupby('rating').sum().sort_values(by='na_sales', ascending=False).reset_index()
rating_sales_na

#calculating share for every rating type
rating_sales_na['percentage_na'] = (rating_sales_na['na_sales']/rating_sales_na['na_sales'].sum())*100

rating_sales_na.head()

# rating shares in NA
rating_sales_na=rating_sales_na[rating_sales_na['na_sales']!=0]

sizes=rating_sales_na.na_sales.values.tolist()
labels = rating_sales_na.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Rating Shares in North America')
#plt.axis('off')
plt.show()

The most selling games have rating E, T, M and E10 (which is a lot as this category was established only 10 years later than the rest).

EU sales¶

rating_sales_eu = df_after_1994[['rating', 'eu_sales']].groupby('rating').sum().sort_values(by='eu_sales', ascending=False).reset_index()
rating_sales_eu

#calculating market share for every rating
rating_sales_eu['percentage_eu'] = (rating_sales_eu['eu_sales']/rating_sales_eu['eu_sales'].sum())*100

rating_sales_eu.head()

# ratings for eu market
rating_sales_eu=rating_sales_eu[rating_sales_eu['eu_sales']>1]

sizes=rating_sales_eu.eu_sales.values.tolist()
labels = rating_sales_eu.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Rating Market Shares in Europe')
#plt.axis('off')
plt.show()

Japan sales¶

rating_sales_jp = df_after_1994[['rating', 'jp_sales']].groupby('rating').sum().sort_values(by='jp_sales', ascending=False).reset_index()
rating_sales_jp

#calculating market share for every genre
rating_sales_jp['percentage_jp'] = (rating_sales_jp['jp_sales']/rating_sales_jp['jp_sales'].sum())*100

rating_sales_jp.head()

# Japan market view for ratings 
rating_sales_jp=rating_sales_jp[rating_sales_jp['jp_sales']!=0]

sizes=rating_sales_jp.jp_sales.values.tolist()
labels = rating_sales_jp.apply(lambda x: str(x[0]) + "\n" + "mln $" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Ratings and their Market Shares in Japan')
#plt.axis('off')
plt.show()

The most popular games in Japan have rating E, games with ratings T, M, E10 are also among the market leaders.

#Creating a table with rating sales split through the markets for the stacked bar chart
rating_sales_all = df_after_1994[['rating', 'na_sales', 'eu_sales', 'jp_sales']].groupby('rating').sum().reset_index()
rating_sales_all

#pie chart for game ratings


labels = rating_sales_all['rating']
value1 = rating_sales_all['na_sales']
value2 = rating_sales_all['eu_sales']
value3 = rating_sales_all['jp_sales']
fig = make_subplots(2, 2, specs=[[{'type':'domain'},    {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['NA', 'EU', 'JP'])
fig.add_trace(go.Pie(labels=labels, values=value1, scalegroup='one',
                     name="NA"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=value2, scalegroup='one',
                     name="EU"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=value3, scalegroup='one',
                     name="JP"), 2, 1)

fig.update_layout(title_text='Market Shares of ESRB Ratings')
fig.show()

#staked bar chart

fig = px.bar(rating_sales_all, x="rating", y=["na_sales", "eu_sales", "jp_sales"],\
             title="Stacked bar chart of ESRB Ratings in 3 regions")
fig.show()

As to the ratings is concerned, European and American gamers have similar preferences. About 40% of them prefer to be on the safe side and choose games with E rating. Almost a quater in every region chooses games rated for Teens and a quater picks up Mature rated games. In EU percentage of M rated games is slightly higer.

In Japan share of games with Teens rating is considerably higher: 33% vs 24% in two other regions. Simultaneously percentage Mature rated games is 14 which is 10% lower than in NA and EU regions.

The share of games for children E10 is similar across all 3 regions, in Japan it is insignificantly lower.

All in all E rating is leading in all regions both in market share and in absolute figures.

Based on the performed data analysis we can create the following User Profiles

An average NA gamer prefers X360, PS2, Wii almost with the same likelihood. Most likely the games will be in genre Action, Sports or Shooter with rating E ("Everyone").

A European gamer has a broder range of preferences: all top 5 platforms have an even chance to be chosen: PS2, PS3, X360, Wii or PS. There is a great probability that likewise his American colleague he or she will pick up Action, Sports or Shooter with rating E.

An average Japanese gamer prefers DS, PS, PS2 most of all. And when it comes to games' choise, most likely it will be role-play or action with rating E.

Part 3. Hypotheses testing ¶

1 Average user ratings of the Xbox One and PC¶

Stating hypothesis (H0 has to state unchanged result):

H0: average user ratings of the Xbox One and PC platforms are the same

H1: average user ratings of the Xbox One differ from average user ratings for PC platform

#creating 2 db for XBox and PC
df_XB = df.query('platform == "XB"')
df_XB.shape

(803, 14)

df_PS = df.query('platform == "PS"')
df_PS.shape

(1190, 14)

#Getting rid of outliers

#defining outliers with 3-sigma method for both platforms
#as we are calulating std for sample ddof is set to 1

std_score_XB = np.std(df_XB['user_score'], ddof=1)
three_sigma_score_XB_lower = round((df_XB['user_score'].mean() - std_score_XB*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 

three_sigma_score_XB_upper = round((df_XB['user_score'].mean() + std_score_XB*3),2)



std_score_PS = np.std(df_PS['user_score'], ddof=1)
three_sigma_score_PS_lower = round((df_PS['user_score'].mean() - std_score_PS*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 

three_sigma_score_PS_upper = round((df_PS['user_score'].mean() + std_score_PS*3),2)

print('99.7% of games on XB platform have rating from  0 to ', three_sigma_score_XB_upper,'. \n99.7% of of games on PS platform have rating from 0 to ', three_sigma_score_PS_upper,'.')

99.7% of games on XB platform have rating from  0 to  15.52 . 
99.7% of of games on PS platform have rating from 0 to  9.99 .

#setting df without outliers

df_XB_no_outliers = df_XB.query('user_score<=@three_sigma_score_XB_upper')

df_PS_no_outliers = df_PS.query('user_score<=@three_sigma_score_PS_upper')

#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_XB = np.var(df_XB_no_outliers['user_score'])
print('Variance for XB ratings sample is ', variance_XB)

Variance for XB ratings sample is  9.493162099164262

variance_PS = np.var(df_PS_no_outliers['user_score'])
print('Variance for PS rating sample is ', variance_PS)

Variance for PS rating sample is  8.494360009886227

Explanation of the method choise for hypothesis testing

As we have two samples of continuous data, the samples are drawn from a normally distributed data with different variances we will conduct a Welch test to test hypothesis (in Python it will be a two-tailed T-test with correction 'equal_var = False').

A critical statistical significance level will be set at 0.05 as it is a commonly-accepted level and as we do not conduct medical testing, higher acuracy is not required

alpha = .05 
# "equal_var = False" as previous calculations proved that the 2 samples have different variance value. 
results =  st.ttest_ind(df_XB_no_outliers['user_score'], df_PS_no_outliers['user_score'], equal_var = False) 

print('p-value:', results.pvalue/2) 
#we are making two-tailed test as we are checking whether the av. rating of the two platforms differ, 
#no matter if one is bigger or smaller than the other

 
if (results.pvalue/2 < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value: 2.571064927205229e-214
We reject the null hypothesis

The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average users score of games on Xbox One platform and average score on PC platform are not the same (μ1 != μ2).

2 Average user ratings for the Action and Sports genres¶

Stating hypothesis (H0 states unchanged result):

H0: average user ratings of Action and Sports genres are the same

H1: average user ratings of Action and Sports genres are different

#creating 2 db for Action and Sports
df_Action = df.query('genre == "Action"')
df_Action.shape

(3306, 14)

df_Sports = df.query('genre == "Sports"')
df_Sports.shape

(2306, 14)

#Getting rid of outliers

#defining outliers with 3-sigma method for both platforms
#as we are calulating std for sample ddof is set to 1

std_score_Action = np.std(df_Action['user_score'], ddof=1)
three_sigma_score_Action_lower = round((df_Action['user_score'].mean() - std_score_Action*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 

three_sigma_score_Action_upper = round((df_Action['user_score'].mean() + std_score_Action*3),2)



std_score_Sports = np.std(df_Sports['user_score'], ddof=1)
three_sigma_score_Sports_lower = round((df_Sports['user_score'].mean() - std_score_Sports*3),2)
#we do not use the lower fence as it will give us negative value for user scores. 

three_sigma_score_Sports_upper = round((df_Sports['user_score'].mean() + std_score_Sports*3),2)

print('99.7% of games of Action genre have rating from  0 to ', three_sigma_score_Action_upper,'. \n99.7% of of games of Sports genre have rating from 0 to ', three_sigma_score_Sports_upper,'.')

99.7% of games of Action genre have rating from  0 to  15.16 . 
99.7% of of games of Sports genre have rating from 0 to  14.86 .

#setting df without outliers

df_Action_no_outliers = df_Action.query('user_score<=@three_sigma_score_Action_upper')

df_Sports_no_outliers = df_Sports.query('user_score<=@three_sigma_score_Sports_upper')

#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_Action = np.var(df_Action_no_outliers['user_score'])
print('Variance for Action ratings sample is ', variance_Action)

Variance for Action ratings sample is  12.85548225027838

variance_Sports = np.var(df_Sports_no_outliers['user_score'])
print('Variance for Sports rating sample is ', variance_Sports)

Variance for Sports rating sample is  13.339070060455569

As we have two samples of continuous data, the samples are drawn from a normally distributed data with close values of variances we will conduct an unpaired t-test to test the stated hypothesis (in Python it will be unpaired two-tailed T-test with 'equal_var = True').

A critical statistical significance level will be set at 0.05 as it is a commonly-accepted one.

alpha = .05 
# "equal_var = True" as previous calculations proved that the 2 samples have very close variance values. 
results =  st.ttest_ind(df_Action_no_outliers['user_score'], df_Sports_no_outliers['user_score'], equal_var = True) 

print('p-value:', results.pvalue/2) 
#we are making two-tailed test as we are checking whether the av. rating of the two genres differ, 
#no matter if one is bigger or smaller than the other

 
if (results.pvalue/2 < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value: 1.8333824403745673e-07
We reject the null hypothesis

The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average users ratings of Action and Sports genres are different (μ1 != μ2).

General conclusion ¶

We have performed analysis of the game platforms data for the period 1980-1916.
As data contained a lot of missing values in some of the sections we had partly restored them. Some data containing missing values was dropped (but not more than 3%).</br>
We explored dynamics of game releases, and found some patterns: game releases boomed from 1994, period of 2008 - 2009 was the most prolific, but was followed by a sharp decline in releases the following years. After a slight recovery in 2014 the number of released games per year stayed stable.

Having analized total sales on 3 major markets though the above years we have figured out that the following platforms were the most successful ones: PS2, X360, PS3, Wii, DS and PS.

They were launched in different years and reined for a while. Peak period lasted for 2-4 years followed by decline of a platform. Decline of one platform coresponded to a launch of another successor. An average life cycle of a platform is 10 years.

By 2016 all the above platforms ended or were ending their life cycle.

Relevant years for platforms analysis were picked up as 2000-2015. This period correspond to 1.5 lifecycle of a platform and could allow to make prognosis.

Sales analysis of the market leaders 2000-2016 shows that the platforms perform differently in terms of total sales. We have different median values, minimum and maximum values, the their sales distribution is scewed in different ways. To conclude, the statistics of sales of leading platforms have significant differences.

Also we have figured out that critis scores of the games have very strong positive linear correlation with the total sales. Correlation with user rating is usually weak, negative and not stable through the platforms.

As to the game distribution by genres, the charts show that the most profitable genres are: action, sports and and shooter. Puzzle and strategy have brought the least amount of money. Despite the fact that these 2 genres look like outsiders, they were very popular back in their days. Puzzle had a very bright but short peak in 2007-2008 and Strategy emerged before 1990 and had several peaks during its cycle. Distributions for these 2 genres have thicker tails to the left.

Distributions of the market leaders have hicker tails to the right.

Analizing platforms behaviour on 3 main markets we can see that every region have their own favourits. Genres popularity on these markets makes it even more obvious that every market has its specifics due to different economical and cultural background.

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
0	Wii Sports	Wii	2006.0	Sports	41.36	28.96	3.77	8.45	76.0	8	E
1	Super Mario Bros.	NES	1985.0	Platform	29.08	3.58	6.81	0.77	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E
3	Wii Sports Resort	Wii	2009.0	Sports	15.61	10.93	3.28	2.95	80.0	8	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	11.27	8.89	10.22	1.00	NaN	NaN	NaN

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
13088	Downforce	PS2	2002.0	Racing	0.03	0.02	0.00	0.01	69.0	tbd	E
10578	The Whispered World	PC	2009.0	Adventure	0.00	0.08	0.00	0.02	70.0	7.6	E
7546	Need for Speed Carbon: Own the City	DS	2006.0	Racing	0.17	0.01	0.00	0.02	70.0	7.5	E
9092	Pirates: The Legend of Black Kat	PS2	2002.0	Adventure	0.07	0.05	0.00	0.02	72.0	7.9	T
3399	Beijing 2008	X360	2008.0	Sports	0.15	0.38	0.00	0.07	NaN	NaN	NaN
14024	Galaxy Fight	SAT	1994.0	Fighting	0.00	0.00	0.04	0.00	NaN	NaN	NaN

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating
0	Wii Sports	Wii	2006.0	Sports	41.36	28.96	3.77	8.45	76.0	8	E
1	Super Mario Bros.	NES	1985.0	Platform	29.08	3.58	6.81	0.77	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E
3	Wii Sports Resort	Wii	2009.0	Sports	15.61	10.93	3.28	2.95	80.0	8	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	11.27	8.89	10.22	1.00	NaN	NaN	NaN

	year_of_release	na_sales	eu_sales	jp_sales	other_sales	critic_score
count	16446.000000	16715.000000	16715.000000	16715.000000	16715.000000	8137.000000
mean	2006.484616	0.263377	0.145060	0.077617	0.047342	68.967679
std	5.877050	0.813604	0.503339	0.308853	0.186731	13.938165
min	1980.000000	0.000000	0.000000	0.000000	0.000000	13.000000
25%	2003.000000	0.000000	0.000000	0.000000	0.000000	60.000000
50%	2007.000000	0.080000	0.020000	0.000000	0.010000	71.000000
75%	2010.000000	0.240000	0.110000	0.040000	0.030000	79.000000
max	2016.000000	41.360000	28.960000	10.220000	10.570000	98.000000

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating
0	Wii Sports	Wii	2006.0	Sports	41.36	28.96	3.77	8.45	76.0	8	E
1	Super Mario Bros.	NES	1985.0	Platform	29.08	3.58	6.81	0.77	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E
3	Wii Sports Resort	Wii	2009.0	Sports	15.61	10.93	3.28	2.95	80.0	8	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	11.27	8.89	10.22	1.00	NaN	NaN	NaN

	name	platform	genre	user_score	rating
count	16713	16715	16713	10014	9949
unique	11559	31	12	96	8
top	Need for Speed: Most Wanted	PS2	Action	tbd	E
freq	12	2161	3369	2424	3990

	column	missing_values_share
3	critic_score	0.513192
5	rating	0.404786
4	user_score	0.400897
1	year_of_release	0.016093
0	name	0.000120
2	genre	0.000120

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating
119	Zumba Fitness	Wii	2010	Sports	3.45	2.59	0.0	0.66	42.0	tbd	E
301	Namco Museum: 50th Anniversary	PS2	2005	Misc	2.08	1.35	0.0	0.54	61.0	tbd	E10+
520	Zumba Fitness 2	Wii	2011	Sports	1.51	1.03	0.0	0.27	NaN	tbd	T
645	uDraw Studio	Wii	2010	Misc	1.65	0.57	0.0	0.20	71.0	tbd	E
718	Just Dance Kids	Wii	2010	Misc	1.52	0.54	0.0	0.18	NaN	tbd	E
...	...	...	...	...	...	...	...	...	...	...	...
16695	Planet Monsters	GBA	2001	Action	0.01	0.00	0.0	0.00	67.0	tbd	E
16697	Bust-A-Move 3000	GC	2003	Puzzle	0.01	0.00	0.0	0.00	53.0	tbd	E
16698	Mega Brain Boost	DS	2008	Puzzle	0.01	0.00	0.0	0.00	48.0	tbd	E
16704	Plushees	DS	2008	Simulation	0.01	0.00	0.0	0.00	NaN	tbd	E
16706	Men in Black II: Alien Escape	GC	2003	Shooter	0.01	0.00	0.0	0.00	50.0	tbd	T

	year_of_release	total_sales
0	2006	82.54
1	1985	40.24
2	2008	35.52
3	2009	32.77
4	1996	31.38
5	1989	30.26
6	1984	28.31
7	2005	24.67
8	1999	23.09
9	2007	22.70
10	2010	21.82
11	2013	21.05
12	2004	20.81
13	1990	20.62
14	1988	17.28
15	2002	16.15
16	2001	14.97
17	2011	14.73
18	1998	14.64
19	2015	14.63
20	2012	13.79
21	2014	12.62
22	1992	11.18
23	1997	10.95
24	1993	10.55
25	1994	9.30
26	1982	7.81
27	2016	7.60
28	2003	7.20
29	1986	6.50
30	2000	6.39
31	1995	5.15
32	1991	4.61
33	1981	4.50
34	1987	4.38
35	1980	4.31
36	1983	3.20

	platform	total_sales
0	PS2	1233.56
1	X360	961.24
2	PS3	931.34
3	Wii	891.18
4	DS	802.78
5	PS	727.58
6	PS4	314.14
7	GBA	312.88
8	PSP	289.53
9	3DS	257.81
10	PC	255.76
11	GB	254.43
12	XB	251.57
13	NES	251.05
14	N64	218.01
15	SNES	200.04
16	GC	196.73
17	XOne	159.32
18	2600	86.48
19	WiiU	82.19
20	PSV	53.81
21	SAT	33.59
22	GEN	28.35
23	DC	15.95
24	SCD	1.86
25	NG	1.44
26	WS	1.42
27	TG16	0.16
28	3DO	0.10
29	GG	0.04
30	PCFX	0.03

	platform	total_sales	sales_z
0	PS2	1233.56	2.759744
1	X360	961.24	1.968011
2	PS3	931.34	1.881081
3	Wii	891.18	1.764321
4	DS	802.78	1.507310

	name	platform	year_of_release	genre	na_sales	eu_sales	critic_score	user_score	rating	total_sales
16676	G1 Jockey 4 2008	PS3	2008	Sports	0.00	0.00	NaN	0.0	NaN	0.00
16709	SCORE International Baja 1000: The Official Game	PS2	2008	Racing	0.00	0.00	NaN	0.0	NaN	0.00
15384	Order of War	PC	2009	Strategy	0.00	0.01	69.0	7.3	T	0.01
15486	Arslan: The Warriors of Legend	XOne	2016	Action	0.01	0.00	69.0	5.8	T	0.01
15499	Motocross Mania 3	XB	2005	Racing	0.01	0.00	38.0	3.6	T	0.01

	year_of_release	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	total_sales
0	2011	0.13	0.13	0.09	0.10	314.0	24.6	0.45
1	2010	2.19	1.54	0.71	1.20	988.0	107.8	5.64
2	2009	9.36	4.94	1.94	10.16	3142.0	293.5	26.40

platform	3DS	DC	DS	GB	GBA	GC	N64	PC	PS	PS2	PS3	PS4	PSP	PSV	WS	Wii	WiiU	X360	XB	XOne
year_of_release
2012	63.20	0.0	26.18	0.0	0.0	0.0	0.0	35.03	0.0	0.45	156.78	0.00	17.82	4.63	0.0	59.65	0.00	143.84	0.0	0.00
2013	51.36	0.0	11.01	0.0	0.0	0.0	0.0	23.22	0.0	0.00	107.36	0.00	7.69	16.19	0.0	21.71	17.56	99.74	0.0	0.00
2014	56.57	0.0	1.54	0.0	0.0	0.0	0.0	12.38	0.0	0.00	113.25	25.99	3.14	10.59	0.0	8.59	21.65	88.58	0.0	18.96
2015	43.76	0.0	0.00	0.0	0.0	0.0	0.0	13.28	0.0	0.00	47.76	100.00	0.24	11.90	0.0	3.75	22.03	34.74	0.0	54.07
2016	27.78	0.0	0.00	0.0	0.0	0.0	0.0	8.52	0.0	0.00	16.82	118.90	0.12	6.25	0.0	1.14	16.35	11.96	0.0	60.14

platform	3DS	DC	DS	GB	GBA	GC	N64	PC	PS	PS2	PS3	PS4	PSP	PSV	WS	Wii	WiiU	X360	XB	XOne
year_of_release
2012	-11.84	0.0	-15.17	0.0	0.0	0.0	0.0	-11.81	0.0	-0.45	-49.42	0.00	-10.13	11.56	0.0	-37.94	17.56	-44.10	0.0	0.00
2013	5.21	0.0	-9.47	0.0	0.0	0.0	0.0	-10.84	0.0	0.00	5.89	25.99	-4.55	-5.60	0.0	-13.12	4.09	-11.16	0.0	18.96
2014	-12.81	0.0	-1.54	0.0	0.0	0.0	0.0	0.90	0.0	0.00	-65.49	74.01	-2.90	1.31	0.0	-4.84	0.38	-53.84	0.0	35.11
2015	-15.98	0.0	0.00	0.0	0.0	0.0	0.0	-4.76	0.0	0.00	-30.94	18.90	-0.12	-5.65	0.0	-2.61	-5.68	-22.78	0.0	6.07
2016	-12.64	0.0	0.00	0.0	0.0	0.0	0.0	-3.27	0.0	0.00	-13.22	-49.65	-0.12	-2.00	0.0	-0.96	-11.75	-10.44	0.0	-33.99

	user_score	critic_score	total_sales
0	0.0	2879.0	480.68
1	1.2	35.0	0.16
2	2.4	28.0	0.21
3	3.5	43.0	0.07
4	3.8	65.0	0.37

genre	Action	Puzzle	Shooter	Sports	Strategy
year_of_release
1980	0.34	0.00	7.07	0.50	0.00
1981	14.79	2.25	10.02	0.78	0.00
1982	6.50	10.04	3.81	1.06	0.00
1983	2.86	0.78	0.49	3.20	0.00
1984	1.85	3.14	31.10	6.17	0.00
1985	3.50	3.21	1.00	1.96	0.00
1986	13.74	0.00	3.89	5.58	0.00
1987	1.11	0.00	0.69	3.71	0.00
1988	1.74	5.58	0.51	3.60	0.00
1989	4.65	37.75	1.19	5.72	0.00
1990	6.39	5.99	0.00	2.11	0.00
1991	6.76	3.24	1.99	2.41	0.94
1992	3.84	4.84	0.28	2.95	0.37
1993	1.81	3.17	3.08	3.18	0.81
1994	1.55	1.53	8.30	8.41	3.57
1995	3.58	2.69	4.15	7.99	6.54
1996	20.60	3.92	6.90	17.42	5.61
1997	27.59	5.87	22.16	30.04	7.72
1998	39.41	6.35	9.80	41.75	13.48
1999	27.73	1.34	12.23	30.21	18.46
2000	33.98	3.83	6.83	41.24	8.54
2001	59.38	7.94	24.81	51.46	7.52
2002	86.65	5.33	48.47	65.31	5.52
2003	67.98	2.42	27.09	55.96	7.93
2004	76.13	8.40	46.94	63.61	7.14
2005	85.39	20.36	41.52	59.49	5.28
2006	66.06	10.74	38.24	135.78	4.15
2007	104.85	23.56	70.61	97.73	9.26
2008	134.96	15.23	59.16	94.58	11.60
2009	137.69	19.79	69.29	137.37	12.29
2010	115.29	10.75	76.83	91.24	13.68
2011	116.68	4.96	98.18	56.08	8.82
2012	119.25	1.72	71.75	30.42	3.26
2013	122.51	0.96	62.04	41.17	6.12
2014	97.23	1.49	65.21	45.15	0.98
2015	72.02	0.71	67.51	40.84	1.85
2016	30.11	0.01	38.22	23.49	1.13

	platform	na_sales	percentage_na
0	X360	595.74	13.722238
1	PS2	572.92	13.196604
2	Wii	486.87	11.214533
3	PS3	390.13	8.986230
4	DS	380.31	8.760037

	platform	eu_sales	percentage_eu
0	PS2	332.63	13.861432
1	PS3	327.21	13.635568
2	X360	268.32	11.181491
3	Wii	258.32	10.764769
4	PS	212.39	8.850763

	platform	year_of_release	total_sales
0	DS	2004	17.27
1	DS	2005	130.14
2	DS	2006	119.81
3	DS	2007	146.94
4	DS	2008	145.31

	platform	jp_sales	percentage_jp
0	DS	175.00	13.559375
1	PS	139.78	10.830454
2	PS2	137.54	10.656894
3	SNES	116.55	9.030543
4	3DS	100.62	7.796253

	platform_name	na_sales	eu_sales	jp_sales
0	3DS	82.65	61.27	100.62
1	DS	380.31	188.41	175.00
2	PS	334.72	212.39	139.78
3	PS2	572.92	332.63	137.54
4	PS3	390.13	327.21	79.41

	genre	na_sales
0	Action	863.17
1	Sports	671.20
2	Shooter	584.83
3	Platform	444.44
4	Misc	399.57
5	Racing	356.86
6	Role-Playing	330.04
7	Fighting	220.51
8	Simulation	180.40
9	Puzzle	121.13
10	Adventure	101.52
11	Strategy	67.75

	genre	na_sales	percentage_na
0	Action	863.17	19.882204
1	Sports	671.20	15.460379
2	Shooter	584.83	13.470938
3	Platform	444.44	10.237203
4	Misc	399.57	9.203671

	genre	eu_sales
0	Action	510.99
1	Sports	371.33
2	Shooter	314.52
3	Racing	234.49
4	Misc	210.60
5	Platform	199.39
6	Role-Playing	188.24
7	Simulation	113.35
8	Fighting	99.00
9	Adventure	63.20
10	Puzzle	49.78
11	Strategy	44.79

	genre	eu_sales	percentage_eu
0	Action	510.99	21.294089
1	Sports	371.33	15.474147
2	Shooter	314.52	13.106748
3	Racing	234.49	9.771720
4	Misc	210.60	8.776170

	genre	jp_sales
0	Role-Playing	353.39
1	Action	160.12
2	Sports	134.93
3	Platform	130.71
4	Misc	107.02
5	Fighting	87.28
6	Simulation	63.64
7	Puzzle	56.68
8	Racing	56.63
9	Adventure	52.24
10	Strategy	49.30
11	Shooter	38.68

	genre	jp_sales	percentage_jp
0	Role-Playing	353.39	27.381414
1	Action	160.12	12.406440
2	Sports	134.93	10.454665
3	Platform	130.71	10.127691
4	Misc	107.02	8.292139

	rating	na_sales
0	E	1274.24
1	T	747.58
2	M	742.87
3	E10+	345.50
4	K-A	2.56
5	EC	1.53
6	AO	1.26
7	RP	0.00

	rating	na_sales	percentage_na
0	E	1274.24	40.899491
1	T	747.58	23.995198
2	M	742.87	23.844021
3	E10+	345.50	11.089570
4	K-A	2.56	0.082169

	rating	eu_sales
0	E	703.87
1	M	480.00
2	T	420.99
3	E10+	183.33
4	AO	0.61
5	K-A	0.27
6	EC	0.11
7	RP	0.03

	rating	eu_sales	percentage_eu
0	E	703.87	39.339709
1	M	480.00	26.827483
2	T	420.99	23.529379
3	E10+	183.33	10.246422
4	AO	0.61	0.034093

	rating	jp_sales
0	E	197.96
1	T	150.70
2	M	63.58
3	E10+	40.20
4	K-A	1.46
5	AO	0.00
6	EC	0.00
7	RP	0.00

	rating	jp_sales	percentage_jp
0	E	197.96	43.613131
1	T	150.70	33.201146
2	M	63.58	14.007491
3	E10+	40.20	8.856576
4	K-A	1.46	0.321657