Aim of the project is to find patterns in the available information for a new ride-sharing company that's launching in Chicago. We will study ther following:

passenger preferences;
impact of external factors on rides;
test a hypothesis about the impact of weather on ride frequency.

Table of contents:¶

Data study
top 10 neighborhoods
Hypothesis testing
Conclusion

Data study ¶

import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import squarify

df_company = pd.read_csv('./project_sql_result_01.csv', sep=',')
df_trips = pd.read_csv('./project_sql_result_04.csv', sep=',')

df_company.head()

df_trips.head()

df_company.sample(6)

df_trips.sample(6)

df_company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 832.0+ bytes

df_trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.2+ KB

#stasistical summary of the df_company data
df_company.describe()

fig = px.box(df_company, y="trips_amount") 
fig.show()

#stasistical summary of the df_trips data
df_trips.describe()

fig = px.box(df_trips, y="average_trips")
fig.show()

#Checking data for zeros:
for i in df_company.columns:
    print(i, len(df_company[df_company[i]==0]))

company_name 0
trips_amount 0

#Checking data for zeros:
for i in df_trips.columns:
    print(i, len(df_trips[df_trips[i]==0]))

dropoff_location_name 0
average_trips 0

df_company['trips_amount'].hist(bins=30)
plt.show()

df_trips['average_trips'].hist(bins=30)
plt.show()

#statistical summary for categorical variables for df_company 
df_company.describe(include=['object'])

#statistical summary for categorical variables for df_trips 
df_trips.describe(include=['object'])

Our initial data consisits of 2 tables with information on taxi companies and trip destinations.

The table with company info has 64 unique company names. Distribution of number of rides for every company each taxi company for the dates of study (trips_amount) is positively skewed with large tail to the right. Middle 50% of the companies had from 14.23 to 299 rides during these days. There quite a lot of outliers in the table. Standard deviation is 3812.

The second table has 94 unique location names. Distribution of the average number of rides that ended in each neighborhood is also positively skewed with lots of outliers. Middle 50% of neighborhoods had from 21 to 2143 rides. Standard deviation is 1714. Both tables have no missing values. There is no need to change data types for the both tables.

top 10 neighborhoods ¶

#nlargest method for defining top 10 neighborhoods
top_10_neighborhoods = df_trips.nlargest(10,['average_trips'])
top_10_neighborhoods

#visualizing number of trips to locations
(top_10_neighborhoods.groupby('dropoff_location_name')
 .agg({'average_trips': sum})
 .plot(y='average_trips', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()

Loop and River North are the mos popular destinations for getting there by taxi, the rest are leaders lagging behind.

graph: taxi companies and number of rides¶

#Ranging the companies by the number of trips
ranged_companies = df_company.groupby('company_name')['trips_amount'].max().sort_values(ascending=False).reset_index()
ranged_companies

#counting z-score for taxi trips to define distribution and to range the companies 
df_company['trips_z']=(df_company['trips_amount']-df_company['trips_amount'].mean())/df_company['trips_amount'].std()
df_company.head()

#distinguishing z-score by colors: red for negative values, blue for positive ones 
df_company['colors'] = ['red' if x<0 else 'purple' for x in df_company['trips_z']]
df_company['colors']
 
plt.figure(figsize=(14,20))
plt.hlines(y=df_company.company_name, xmin=0, xmax=df_company.trips_z, colors=df_company.colors, alpha=0.4, linewidth=10)
plt.show()

The graph clearly shows that the most successful and popular taxi companies in Chicago as of November 2017 with number of trips above average are to the right (in purple) and less successful are to the left (in red) of the chart.

#Geting the top 16 popular taxi companies marked in purple in previous chatrt
top_16_companies = df_company.nlargest(16,['trips_amount'])
top_16_companies

top_taxi = top_16_companies.company_name.tolist()
df_company['company_name_tree'] = [x if x in top_taxi else 'others' for x in df_company['company_name']]
df_company['company_name_tree'].value_counts()

others                                  48
Yellow Cab                               1
Taxi Affiliation Services                1
Nova Taxi Affiliation Llc                1
Taxi Affiliation Service Yellow          1
Choice Taxi Association                  1
Star North Management LLC                1
City Service                             1
Dispatch Taxi Affiliation                1
Globe Taxi                               1
Chicago Carriage Cab Corp                1
Checker Taxi Affiliation                 1
Flash Cab                                1
Medallion Leasin                         1
Blue Ribbon Taxi Association Inc.        1
Sun Taxi                                 1
Patriot Taxi Dba Peace Taxi Associat     1
Name: company_name_tree, dtype: int64

# General market view from the total sales point of view 
df_taxi_tree = df_company[['company_name_tree', 'trips_amount']].groupby('company_name_tree').sum().sort_values(by='trips_amount', ascending=False).reset_index()
df_taxi_tree=df_taxi_tree[df_taxi_tree['trips_amount']!=0]

sizes=df_taxi_tree.trips_amount.values.tolist()
labels = df_taxi_tree.apply(lambda x: str(x[0]) + "\n" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Market Structure of Taxi Companies in Chicago, November 2017')
#plt.axis('off')
plt.show()

The graphs show that Flash Cab is an undisputable leader on Chicago market with the market share twice as much as his closest competitor, Taxi Affiliation Services. The rest top 8 market companies have very close market shares without any distinct leaders among them. Then we see a slight gap and another group of 6 leaders closing the top.

graph: top 10 neighborhoods by number of dropoffs¶

fig,ax=plt.subplots(figsize=(13,10))
ax.vlines(x=top_10_neighborhoods.dropoff_location_name, ymin=0, ymax=top_10_neighborhoods.average_trips, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=top_10_neighborhoods.dropoff_location_name, y=top_10_neighborhoods.average_trips, s=75, color='lightgreen', alpha=0.7)


ax.set_title('Lolipop Chart for Top 10 Neighborhoods', fontdict={'size':16})
ax.set_ylabel('average number of rides')
ax.set_xticks(top_10_neighborhoods.dropoff_location_name)
ax.set_xticklabels(top_10_neighborhoods.dropoff_location_name,rotation=90, fontdict={'horizontalalignment':'right','size':14})

for  row in top_10_neighborhoods.itertuples():
    ax.text(row.dropoff_location_name, row.average_trips+30,s=round(row.average_trips,2), rotation=90, fontdict={'horizontalalignment':'center','size':10})

#top 10 locations in comparison to the other 25
top_35_neighborhoods = df_trips.nlargest(35,['average_trips'])
top_35_neighborhoods['colors'] = ['blue' if x<1259 else 'green' for x in top_35_neighborhoods['average_trips']]
top_35_neighborhoods['colors']
 
plt.figure(figsize=(10,11))
plt.hlines(y=top_35_neighborhoods.dropoff_location_name, xmin=0, xmax=top_35_neighborhoods.average_trips, colors=top_35_neighborhoods.colors, alpha=0.4, linewidth=10)
plt.show()

The graph shows top 10 destinations for drop-off (in green), the rest less popular destinations are marked in blue. Clearly people tend to take taxis to parks less often than to pubs and restaurants.

Top 10 neighborhoods vary greatly in the number of drop-offs. The most popular destination was Loop with 10727 drop-offs there, whereas people went to Sheffield & DePaul almost 9-fold less, 1259 drop-offs. The top 4 locations (Loop, River North, Streeterville, and West Loop) leave far behind all the rest of the localities. But all the top spots are very popular destinations with lots of eateries, fancy shops, theaters, dining and nightlife hotspots. O'Hare airport is only on the 5th popular place. Sheffield & DePaul closing the leading 10 is a popular place to live among young professionals, lots of them rent houses there. Based on the most popular destinations for taxi trips we can assume that most of the clients of the taxi companies are tourists and youngsters.

Testing hypotheses ¶

#loading the dataset
df_airport = pd.read_csv('./project_sql_result_07.csv', sep=',')
df_airport.sample(7)

df_airport.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 16.8+ KB

The database does not contain any missing values but we need to change 'start_ts' type to datetime.

from datetime import datetime
df_airport['start_ts']=pd.to_datetime(df_airport['start_ts'], format='%Y.%m.%d') #changing data type from object to datetime
df_airport['start_ts'].head()

0   2017-11-25 16:00:00
1   2017-11-25 14:00:00
2   2017-11-25 12:00:00
3   2017-11-04 10:00:00
4   2017-11-11 07:00:00
Name: start_ts, dtype: datetime64[ns]

df_airport['weather_conditions'].value_counts()

Good    888
Bad     180
Name: weather_conditions, dtype: int64

#defining days of week
df_airport['dow'] = df_airport['start_ts'].dt.weekday
df_airport['dow'].value_counts()

5    1068
Name: dow, dtype: int64

We have only Saturdays in the db.

#statistical summary
df_airport.describe()

Destination between

df_airport.hist('duration_seconds', bins=30)

plt.show()

Destination between Loop and O'Hare Airport is 30 km thus we can drop outliers. The trip to ORD could not take less than 800 seconds (average speed should be 130 kmh with speed limit in the area 112 kmh). On the other hand the trip could not take more than 5000 sec (that implies an average speed of 20 kmh).

df_airport_no_outliers = df_airport.query('duration_seconds>=800 and duration_seconds<=5000')
df_airport_no_outliers.hist('duration_seconds', bins=30)

plt.show()

There are two strong picks: at 1300 sec. and 2400 sec. At this ppoint it can be explained by different road conditions. Some trips were twice faster than the others.

Stating hypothesis: </br>¶

(H0 has to state unchanged result)

H0: The average duration of rides from the Loop to O'Hare International Airport stays the same on rainy Saturdays.

H1: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

#creating 2 db for rainy and sunny Saturdays
df_rainy = df_airport_no_outliers.query('weather_conditions == "Bad"')
df_rainy.shape

(178, 4)

df_sunny = df_airport_no_outliers.query('weather_conditions == "Good"')
df_sunny.shape

(878, 4)

#Getting rid of outliers defining upper fence for every trip group

#defining outliers with 3-sigma method for both db
#as we are calulating std for sample ddof is set to 1

std_score_rainy = np.std(df_rainy['duration_seconds'], ddof=1)

three_sigma_score_rainy_upper = round((df_rainy['duration_seconds'].mean() + std_score_rainy*3),2)



std_score_sunny = np.std(df_sunny['duration_seconds'], ddof=1)

three_sigma_score_sunny_upper = round((df_sunny['duration_seconds'].mean() + std_score_sunny*3),2)

print('99.7% of trips on rainy Saturdays last not more than ', three_sigma_score_rainy_upper,'seconds. \n99.7% of of trips on sunny Saturdays last up to ', three_sigma_score_sunny_upper,'seconds.')

99.7% of trips on rainy Saturdays last not more than  4533.7 seconds. 
99.7% of of trips on sunny Saturdays last up to  4111.24 seconds.

#setting df without outliers

df_rainy_no_outliers = df_rainy.query('duration_seconds<=@three_sigma_score_rainy_upper')

df_sunny_no_outliers = df_sunny.query('duration_seconds<=@three_sigma_score_sunny_upper')

df_rainy_no_outliers.shape

(177, 4)

df_sunny_no_outliers.shape

(874, 4)

#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_rainy = np.var(df_rainy_no_outliers['duration_seconds'])
print('Variance for trips to airport on rainy Saturdays is ', variance_rainy)

Variance for trips to airport on rainy Saturdays is  446454.76746784133

variance_sunny = np.var(df_sunny_no_outliers['duration_seconds'])
print('Variance for trips on sunny Saturdays is ', variance_sunny)

Variance for trips on sunny Saturdays is  463236.5557760674

Explanation of the method choise for hypothesis testing

Thus we have two samples of continuous data, the samples are drawn from a normally distributed data with different variances we will conduct a Welch test to test hypothesis (in Python it will be a two-tailed T-test with correction 'equal_var = False').

A critical statistical significance level will be set at 0.05 as it is a commonly-accepted level and as we do not conduct medical testing, higher acuracy is not required.

alpha = .05 
# "equal_var = False" as previous calculations proved that the 2 samples have different variance value. 
results =  st.ttest_ind(df_rainy_no_outliers['duration_seconds'], df_sunny_no_outliers['duration_seconds'], equal_var = False) 

print('p-value:', results.pvalue/2) 
#we are making two-tailed test as we are checking whether the average travel time to the airport depends on weather conditions, 
#no matter if it gets longer or shorter
 
if (results.pvalue/2 < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value: 2.977952585818491e-14
We reject the null hypothesis

The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays (μ1 != μ2).

Conclusion ¶

We looked into 3 data sets, defined the market leaders among taxi companies as of Novemmber 2017, popular taxi destinations and tested hypothesis.
As to the taxi companies we have figured out that there are over 60 companies on market and Flash Cab is an undisputable leader on Chicago market.
Also we have checked 10 top popular destinations for taxi rides in the area. It came as no surprise that night clubs, restaurants and airport are much more popular destinations than parks and museums for driving by taxi to.
Moreover we have tested the hypothesis that weather conditions influence an average ride duration to the airport. We have come to conclusion that average ride time with good weather conditions differs from the one with bad weather. As soon as we checked only trips during Saturdays influence of traffic jams tends to zero.

	dropoff_location_name	average_trips
55	Dunning	30.166667
77	Archer Heights	8.366667
66	Washington Park	16.033333
10	Lincoln Park	1246.533333
16	Logan Square	620.333333
51	Woodlawn	36.166667

	trips_amount
count	64.000000
mean	2145.484375
std	3812.310186
min	2.000000
25%	20.750000
50%	178.500000
75%	2106.500000
max	19558.000000

	average_trips
count	94.000000
mean	599.953728
std	1714.591098
min	1.800000
25%	14.266667
50%	52.016667
75%	298.858333
max	10727.466667

	dropoff_location_name	average_trips
0	Loop	10727.466667
1	River North	9523.666667
2	Streeterville	6664.666667
3	West Loop	5163.666667
4	O'Hare	2546.900000
5	Lake View	2420.966667
6	Grant Park	2068.533333
7	Museum Campus	1510.000000
8	Gold Coast	1364.233333
9	Sheffield & DePaul	1259.766667

	company_name	trips_amount
0	Flash Cab	19558
1	Taxi Affiliation Services	11422
2	Medallion Leasin	10367
3	Yellow Cab	9888
4	Taxi Affiliation Service Yellow	9299
...	...	...
59	2733 - 74600 Benny Jona	7
60	4053 - 40193 Adwar H. Nikola	7
61	5874 - 73628 Sergey Cab Corp.	5
62	2241 - 44667 - Felman Corp, Manuel Alonso	3
63	3556 - 36214 RC Andrews Cab	2

	company_name	trips_amount
50	2823 - 73307 Lee Express Inc	15
44	2092 - 61288 Sbeih company	27
63	3556 - 36214 RC Andrews Cab	2
10	Choice Taxi Association	5015
37	1469 - 64126 Omar Jada	36
23	KOAM Taxi Association	1259

	company_name	trips_amount	trips_z
0	Flash Cab	19558	4.567445
1	Taxi Affiliation Services	11422	2.433306
2	Medallion Leasin	10367	2.156570
3	Yellow Cab	9888	2.030925
4	Taxi Affiliation Service Yellow	9299	1.876425

	start_ts	weather_conditions	duration_seconds
976	2017-11-25 10:00:00	Good	1584.0
218	2017-11-18 17:00:00	Bad	2100.0
784	2017-11-11 17:00:00	Good	2089.0
684	2017-11-04 17:00:00	Bad	2151.0
234	2017-11-04 05:00:00	Good	1620.0
117	2017-11-11 07:00:00	Good	1380.0
1011	2017-11-18 14:00:00	Good	3660.0

	duration_seconds	dow
count	1068.000000	1068.0
mean	2071.731273	5.0
std	769.461125	0.0
min	0.000000	5.0
25%	1438.250000	5.0
50%	1980.000000	5.0
75%	2580.000000	5.0
max	7440.000000	5.0