Aim of the project is to find patterns in the available information for a new ride-sharing company that's launching in Chicago. We will study ther following:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import squarify
df_company = pd.read_csv('./project_sql_result_01.csv', sep=',')
df_trips = pd.read_csv('./project_sql_result_04.csv', sep=',')
df_company.head()
df_trips.head()
df_company.sample(6)
df_trips.sample(6)
df_company.info()
df_trips.info()
#stasistical summary of the df_company data
df_company.describe()
fig = px.box(df_company, y="trips_amount")
fig.show()
#stasistical summary of the df_trips data
df_trips.describe()
fig = px.box(df_trips, y="average_trips")
fig.show()
#Checking data for zeros:
for i in df_company.columns:
print(i, len(df_company[df_company[i]==0]))
#Checking data for zeros:
for i in df_trips.columns:
print(i, len(df_trips[df_trips[i]==0]))
df_company['trips_amount'].hist(bins=30)
plt.show()
df_trips['average_trips'].hist(bins=30)
plt.show()
#statistical summary for categorical variables for df_company
df_company.describe(include=['object'])
#statistical summary for categorical variables for df_trips
df_trips.describe(include=['object'])
Our initial data consisits of 2 tables with information on taxi companies and trip destinations.
The table with company info has 64 unique company names. Distribution of number of rides for every company each taxi company for the dates of study (trips_amount) is positively skewed with large tail to the right. Middle 50% of the companies had from 14.23 to 299 rides during these days. There quite a lot of outliers in the table. Standard deviation is 3812.
The second table has 94 unique location names. Distribution of the average number of rides that ended in each neighborhood is also positively skewed with lots of outliers. Middle 50% of neighborhoods had from 21 to 2143 rides. Standard deviation is 1714.
Both tables have no missing values.
There is no need to change data types for the both tables.
#nlargest method for defining top 10 neighborhoods
top_10_neighborhoods = df_trips.nlargest(10,['average_trips'])
top_10_neighborhoods
#visualizing number of trips to locations
(top_10_neighborhoods.groupby('dropoff_location_name')
.agg({'average_trips': sum})
.plot(y='average_trips', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()
Loop and River North are the mos popular destinations for getting there by taxi, the rest are leaders lagging behind.
#Ranging the companies by the number of trips
ranged_companies = df_company.groupby('company_name')['trips_amount'].max().sort_values(ascending=False).reset_index()
ranged_companies
#counting z-score for taxi trips to define distribution and to range the companies
df_company['trips_z']=(df_company['trips_amount']-df_company['trips_amount'].mean())/df_company['trips_amount'].std()
df_company.head()
#distinguishing z-score by colors: red for negative values, blue for positive ones
df_company['colors'] = ['red' if x<0 else 'purple' for x in df_company['trips_z']]
df_company['colors']
plt.figure(figsize=(14,20))
plt.hlines(y=df_company.company_name, xmin=0, xmax=df_company.trips_z, colors=df_company.colors, alpha=0.4, linewidth=10)
plt.show()
The graph clearly shows that the most successful and popular taxi companies in Chicago as of November 2017 with number of trips above average are to the right (in purple) and less successful are to the left (in red) of the chart.
#Geting the top 16 popular taxi companies marked in purple in previous chatrt
top_16_companies = df_company.nlargest(16,['trips_amount'])
top_16_companies
top_taxi = top_16_companies.company_name.tolist()
df_company['company_name_tree'] = [x if x in top_taxi else 'others' for x in df_company['company_name']]
df_company['company_name_tree'].value_counts()
# General market view from the total sales point of view
df_taxi_tree = df_company[['company_name_tree', 'trips_amount']].groupby('company_name_tree').sum().sort_values(by='trips_amount', ascending=False).reset_index()
df_taxi_tree=df_taxi_tree[df_taxi_tree['trips_amount']!=0]
sizes=df_taxi_tree.trips_amount.values.tolist()
labels = df_taxi_tree.apply(lambda x: str(x[0]) + "\n" + str(round(x[1])) , axis=1)
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# Decorate
plt.title('Market Structure of Taxi Companies in Chicago, November 2017')
#plt.axis('off')
plt.show()
The graphs show that Flash Cab is an undisputable leader on Chicago market with the market share twice as much as his closest competitor, Taxi Affiliation Services. The rest top 8 market companies have very close market shares without any distinct leaders among them. Then we see a slight gap and another group of 6 leaders closing the top.
fig,ax=plt.subplots(figsize=(13,10))
ax.vlines(x=top_10_neighborhoods.dropoff_location_name, ymin=0, ymax=top_10_neighborhoods.average_trips, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=top_10_neighborhoods.dropoff_location_name, y=top_10_neighborhoods.average_trips, s=75, color='lightgreen', alpha=0.7)
ax.set_title('Lolipop Chart for Top 10 Neighborhoods', fontdict={'size':16})
ax.set_ylabel('average number of rides')
ax.set_xticks(top_10_neighborhoods.dropoff_location_name)
ax.set_xticklabels(top_10_neighborhoods.dropoff_location_name,rotation=90, fontdict={'horizontalalignment':'right','size':14})
for row in top_10_neighborhoods.itertuples():
ax.text(row.dropoff_location_name, row.average_trips+30,s=round(row.average_trips,2), rotation=90, fontdict={'horizontalalignment':'center','size':10})
#top 10 locations in comparison to the other 25
top_35_neighborhoods = df_trips.nlargest(35,['average_trips'])
top_35_neighborhoods['colors'] = ['blue' if x<1259 else 'green' for x in top_35_neighborhoods['average_trips']]
top_35_neighborhoods['colors']
plt.figure(figsize=(10,11))
plt.hlines(y=top_35_neighborhoods.dropoff_location_name, xmin=0, xmax=top_35_neighborhoods.average_trips, colors=top_35_neighborhoods.colors, alpha=0.4, linewidth=10)
plt.show()
The graph shows top 10 destinations for drop-off (in green), the rest less popular destinations are marked in blue. Clearly people tend to take taxis to parks less often than to pubs and restaurants.
Top 10 neighborhoods vary greatly in the number of drop-offs. The most popular destination was Loop with 10727 drop-offs there, whereas people went to Sheffield & DePaul almost 9-fold less, 1259 drop-offs. The top 4 locations (Loop, River North, Streeterville, and West Loop) leave far behind all the rest of the localities. But all the top spots are very popular destinations with lots of eateries, fancy shops, theaters, dining and nightlife hotspots. O'Hare airport is only on the 5th popular place. Sheffield & DePaul closing the leading 10 is a popular place to live among young professionals, lots of them rent houses there. Based on the most popular destinations for taxi trips we can assume that most of the clients of the taxi companies are tourists and youngsters.
#loading the dataset
df_airport = pd.read_csv('./project_sql_result_07.csv', sep=',')
df_airport.sample(7)
df_airport.info()
The database does not contain any missing values but we need to change 'start_ts' type to datetime.
from datetime import datetime
df_airport['start_ts']=pd.to_datetime(df_airport['start_ts'], format='%Y.%m.%d') #changing data type from object to datetime
df_airport['start_ts'].head()
df_airport['weather_conditions'].value_counts()
#defining days of week
df_airport['dow'] = df_airport['start_ts'].dt.weekday
df_airport['dow'].value_counts()
We have only Saturdays in the db.
#statistical summary
df_airport.describe()
Destination between
df_airport.hist('duration_seconds', bins=30)
plt.show()
Destination between Loop and O'Hare Airport is 30 km thus we can drop outliers. The trip to ORD could not take less than 800 seconds (average speed should be 130 kmh with speed limit in the area 112 kmh). On the other hand the trip could not take more than 5000 sec (that implies an average speed of 20 kmh).
df_airport_no_outliers = df_airport.query('duration_seconds>=800 and duration_seconds<=5000')
df_airport_no_outliers.hist('duration_seconds', bins=30)
plt.show()
There are two strong picks: at 1300 sec. and 2400 sec. At this ppoint it can be explained by different road conditions. Some trips were twice faster than the others.
(H0 has to state unchanged result)
H0: The average duration of rides from the Loop to O'Hare International Airport stays the same on rainy Saturdays.
H1: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.
#creating 2 db for rainy and sunny Saturdays
df_rainy = df_airport_no_outliers.query('weather_conditions == "Bad"')
df_rainy.shape
df_sunny = df_airport_no_outliers.query('weather_conditions == "Good"')
df_sunny.shape
#Getting rid of outliers defining upper fence for every trip group
#defining outliers with 3-sigma method for both db
#as we are calulating std for sample ddof is set to 1
std_score_rainy = np.std(df_rainy['duration_seconds'], ddof=1)
three_sigma_score_rainy_upper = round((df_rainy['duration_seconds'].mean() + std_score_rainy*3),2)
std_score_sunny = np.std(df_sunny['duration_seconds'], ddof=1)
three_sigma_score_sunny_upper = round((df_sunny['duration_seconds'].mean() + std_score_sunny*3),2)
print('99.7% of trips on rainy Saturdays last not more than ', three_sigma_score_rainy_upper,'seconds. \n99.7% of of trips on sunny Saturdays last up to ', three_sigma_score_sunny_upper,'seconds.')
#setting df without outliers
df_rainy_no_outliers = df_rainy.query('duration_seconds<=@three_sigma_score_rainy_upper')
df_sunny_no_outliers = df_sunny.query('duration_seconds<=@three_sigma_score_sunny_upper')
df_rainy_no_outliers.shape
df_sunny_no_outliers.shape
#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_rainy = np.var(df_rainy_no_outliers['duration_seconds'])
print('Variance for trips to airport on rainy Saturdays is ', variance_rainy)
variance_sunny = np.var(df_sunny_no_outliers['duration_seconds'])
print('Variance for trips on sunny Saturdays is ', variance_sunny)
Explanation of the method choise for hypothesis testing
Thus we have two samples of continuous data, the samples are drawn from a normally distributed data with different variances we will conduct a Welch test to test hypothesis (in Python it will be a two-tailed T-test with correction 'equal_var = False').
A critical statistical significance level will be set at 0.05 as it is a commonly-accepted level and as we do not conduct medical testing, higher acuracy is not required.
alpha = .05
# "equal_var = False" as previous calculations proved that the 2 samples have different variance value.
results = st.ttest_ind(df_rainy_no_outliers['duration_seconds'], df_sunny_no_outliers['duration_seconds'], equal_var = False)
print('p-value:', results.pvalue/2)
#we are making two-tailed test as we are checking whether the average travel time to the airport depends on weather conditions,
#no matter if it gets longer or shorter
if (results.pvalue/2 < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays (μ1 != μ2).
We looked into 3 data sets, defined the market leaders among taxi companies as of Novemmber 2017, popular taxi destinations and tested hypothesis.
As to the taxi companies we have figured out that there are over 60 companies on market and Flash Cab is an undisputable leader on Chicago market.
Also we have checked 10 top popular destinations for taxi rides in the area. It came as no surprise that night clubs, restaurants and airport are much more popular destinations than parks and museums for driving by taxi to.
Moreover we have tested the hypothesis that weather conditions influence an average ride duration to the airport. We have come to conclusion that average ride time with good weather conditions differs from the one with bad weather. As soon as we checked only trips during Saturdays influence of traffic jams tends to zero.