Technical Requirements

Business goal: analyze trending-video history on YouTube
How often the dashboard will be used: at least once a day
Target dashboard user: video ads planning managers

Dashboard data content:

  • Trending videos from the past, broken down by day and category
  • Likes, broken down by video category
  • Dislikes, broken down by video category

Parameters according to which the data is to be grouped:

  • Trending date and time (don't forget that likes and dislikes must be calculated as of the last trending day of the video)
  • Video category
  • Region

The nature of the data:

  • Trending history — absolute values with a breakdown by day
  • Likes broken down by category — relative values (% of total likes)
  • Dislikes broken down by category — relative values (% of total dislikes)

Importance: the trending-video history graph is the most important, so it should take up at least half of the dashboard
Data sources for the dashboard: raw data on trending (the trending_vids table)
Database storing aggregate data: additional aggregate tables in the youtube database
Data update interval: once every 24 hours, at midnight UTC;
Graphs to be displayed and their order, dashboard controls to be included

In [14]:
import pandas as pd
from IPython.display import display
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

Data extracting

Reading the data base

In [2]:
#connecting to db


db_config = {'user': 'practicum_student',         # user name
             'pwd': 's65BlTKV3faNIGhmvJVzOqhs', # password
             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
             'port': 6432,              # connection port
             'db': 'data-analyst-youtube-data'}          # the name of the database

connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'], db_config['pwd'], db_config['host'], db_config['port'], db_config['db'])

engine = create_engine(connection_string) 
In [3]:
query = '''
           SELECT *
           FROM trending_by_time
        '''
In [4]:
pd.io.sql.read_sql(query, con = engine)

trending_by_time = pd.io.sql.read_sql(query, con = engine, index_col = 'record_id')

trending_by_time.to_csv('trending_by_time.csv', index = False)

trending_by_time.head(5) 
Out[4]:
region trending_date category_title videos_count
record_id
1 France 2017-11-14 Autos & Vehicles 8
2 France 2017-11-15 Autos & Vehicles 2
3 France 2017-11-16 Autos & Vehicles 6
4 France 2017-11-17 Autos & Vehicles 8
5 France 2017-11-18 Autos & Vehicles 4

Control table

In [ ]:
trending_by_time.info(memory_usage='deep')
In [17]:
pivoit_countries_category = trending_by_time.pivot_table(index='category_title',columns='region',values='videos_count',aggfunc='sum').reset_index().fillna(0)
pivoit_countries_category
Out[17]:
region category_title France India Japan Russia United States
0 Autos & Vehicles 1220.0 138.0 538.0 3116.0 758.0
1 Comedy 8446.0 6814.0 1372.0 5968.0 6870.0
2 Education 1480.0 2360.0 212.0 1326.0 3284.0
3 Entertainment 19020.0 32924.0 11734.0 11692.0 19638.0
4 Film & Animation 3768.0 3298.0 2140.0 5676.0 4680.0
5 Gaming 2786.0 132.0 1834.0 2050.0 1606.0
6 Howto & Style 4668.0 1674.0 1574.0 3928.0 8280.0
7 Movies 22.0 32.0 0.0 2.0 0.0
8 Music 7658.0 7714.0 2480.0 3664.0 12874.0
9 News & Politics 6526.0 10346.0 2654.0 9858.0 4818.0
10 Nonprofits & Activism 0.0 0.0 0.0 0.0 106.0
11 People & Blogs 9346.0 4988.0 5792.0 18452.0 6122.0
12 Pets & Animals 468.0 6.0 2250.0 1154.0 1832.0
13 Science & Technology 1588.0 1096.0 300.0 2226.0 4722.0
14 Shows 198.0 410.0 0.0 388.0 114.0
15 Sports 8002.0 1424.0 3606.0 3684.0 4250.0
16 Trailers 4.0 0.0 0.0 0.0 0.0
17 Travel & Events 204.0 16.0 276.0 510.0 804.0

Tableau viz link: