Aim of the project is to find patterns in the available information for a new ride-sharing company that's launching in Chicago. We will study ther following:

  • passenger preferences;
  • impact of external factors on rides;
  • test a hypothesis about the impact of weather on ride frequency.

Table of contents:

Data study

In [1]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import squarify

df_company = pd.read_csv('./project_sql_result_01.csv', sep=',')
df_trips = pd.read_csv('./project_sql_result_04.csv', sep=',')
In [2]:
df_company.head()
Out[2]:
company_name trips_amount
0 Flash Cab 19558
1 Taxi Affiliation Services 11422
2 Medallion Leasin 10367
3 Yellow Cab 9888
4 Taxi Affiliation Service Yellow 9299
In [3]:
df_trips.head()
Out[3]:
dropoff_location_name average_trips
0 Loop 10727.466667
1 River North 9523.666667
2 Streeterville 6664.666667
3 West Loop 5163.666667
4 O'Hare 2546.900000
In [4]:
df_company.sample(6)
Out[4]:
company_name trips_amount
50 2823 - 73307 Lee Express Inc 15
44 2092 - 61288 Sbeih company 27
63 3556 - 36214 RC Andrews Cab 2
10 Choice Taxi Association 5015
37 1469 - 64126 Omar Jada 36
23 KOAM Taxi Association 1259
In [5]:
df_trips.sample(6)
Out[5]:
dropoff_location_name average_trips
55 Dunning 30.166667
77 Archer Heights 8.366667
66 Washington Park 16.033333
10 Lincoln Park 1246.533333
16 Logan Square 620.333333
51 Woodlawn 36.166667
In [6]:
df_company.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 832.0+ bytes
In [7]:
df_trips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.2+ KB
In [8]:
#stasistical summary of the df_company data
df_company.describe()
Out[8]:
trips_amount
count 64.000000
mean 2145.484375
std 3812.310186
min 2.000000
25% 20.750000
50% 178.500000
75% 2106.500000
max 19558.000000
In [9]:
fig = px.box(df_company, y="trips_amount") 
fig.show()
In [10]:
#stasistical summary of the df_trips data
df_trips.describe()
Out[10]:
average_trips
count 94.000000
mean 599.953728
std 1714.591098
min 1.800000
25% 14.266667
50% 52.016667
75% 298.858333
max 10727.466667
In [11]:
fig = px.box(df_trips, y="average_trips")
fig.show()
In [12]:
#Checking data for zeros:
for i in df_company.columns:
    print(i, len(df_company[df_company[i]==0]))
company_name 0
trips_amount 0
In [13]:
#Checking data for zeros:
for i in df_trips.columns:
    print(i, len(df_trips[df_trips[i]==0]))
dropoff_location_name 0
average_trips 0
In [14]:
df_company['trips_amount'].hist(bins=30)
plt.show()
In [15]:
df_trips['average_trips'].hist(bins=30)
plt.show()
In [16]:
#statistical summary for categorical variables for df_company 
df_company.describe(include=['object'])
Out[16]:
company_name
count 64
unique 64
top 3620 - 52292 David K. Cab Corp.
freq 1
In [17]:
#statistical summary for categorical variables for df_trips 
df_trips.describe(include=['object'])
Out[17]:
dropoff_location_name
count 94
unique 94
top Douglas
freq 1


Our initial data consisits of 2 tables with information on taxi companies and trip destinations.

The table with company info has 64 unique company names. Distribution of number of rides for every company each taxi company for the dates of study (trips_amount) is positively skewed with large tail to the right. Middle 50% of the companies had from 14.23 to 299 rides during these days. There quite a lot of outliers in the table. Standard deviation is 3812.


The second table has 94 unique location names. Distribution of the average number of rides that ended in each neighborhood is also positively skewed with lots of outliers. Middle 50% of neighborhoods had from 21 to 2143 rides. Standard deviation is 1714. Both tables have no missing values. There is no need to change data types for the both tables.

top 10 neighborhoods

In [18]:
#nlargest method for defining top 10 neighborhoods
top_10_neighborhoods = df_trips.nlargest(10,['average_trips'])
top_10_neighborhoods
Out[18]:
dropoff_location_name average_trips
0 Loop 10727.466667
1 River North 9523.666667
2 Streeterville 6664.666667
3 West Loop 5163.666667
4 O'Hare 2546.900000
5 Lake View 2420.966667
6 Grant Park 2068.533333
7 Museum Campus 1510.000000
8 Gold Coast 1364.233333
9 Sheffield & DePaul 1259.766667
In [19]:
#visualizing number of trips to locations
(top_10_neighborhoods.groupby('dropoff_location_name')
 .agg({'average_trips': sum})
 .plot(y='average_trips', kind='bar', grid=True, figsize=(10,5), cmap='PiYG')
)
plt.show()

Loop and River North are the mos popular destinations for getting there by taxi, the rest are leaders lagging behind.

graph: taxi companies and number of rides

In [20]:
#Ranging the companies by the number of trips
ranged_companies = df_company.groupby('company_name')['trips_amount'].max().sort_values(ascending=False).reset_index()
ranged_companies 
Out[20]:
company_name trips_amount
0 Flash Cab 19558
1 Taxi Affiliation Services 11422
2 Medallion Leasin 10367
3 Yellow Cab 9888
4 Taxi Affiliation Service Yellow 9299
... ... ...
59 2733 - 74600 Benny Jona 7
60 4053 - 40193 Adwar H. Nikola 7
61 5874 - 73628 Sergey Cab Corp. 5
62 2241 - 44667 - Felman Corp, Manuel Alonso 3
63 3556 - 36214 RC Andrews Cab 2

64 rows × 2 columns

In [21]:
#counting z-score for taxi trips to define distribution and to range the companies 
df_company['trips_z']=(df_company['trips_amount']-df_company['trips_amount'].mean())/df_company['trips_amount'].std()
df_company.head()
Out[21]:
company_name trips_amount trips_z
0 Flash Cab 19558 4.567445
1 Taxi Affiliation Services 11422 2.433306
2 Medallion Leasin 10367 2.156570
3 Yellow Cab 9888 2.030925
4 Taxi Affiliation Service Yellow 9299 1.876425
In [22]:
#distinguishing z-score by colors: red for negative values, blue for positive ones 
df_company['colors'] = ['red' if x<0 else 'purple' for x in df_company['trips_z']]
df_company['colors']
 
plt.figure(figsize=(14,20))
plt.hlines(y=df_company.company_name, xmin=0, xmax=df_company.trips_z, colors=df_company.colors, alpha=0.4, linewidth=10)
plt.show()

The graph clearly shows that the most successful and popular taxi companies in Chicago as of November 2017 with number of trips above average are to the right (in purple) and less successful are to the left (in red) of the chart.

In [23]:
#Geting the top 16 popular taxi companies marked in purple in previous chatrt
top_16_companies = df_company.nlargest(16,['trips_amount'])
top_16_companies
Out[23]:
company_name trips_amount trips_z colors
0 Flash Cab 19558 4.567445 purple
1 Taxi Affiliation Services 11422 2.433306 purple
2 Medallion Leasin 10367 2.156570 purple
3 Yellow Cab 9888 2.030925 purple
4 Taxi Affiliation Service Yellow 9299 1.876425 purple
5 Chicago Carriage Cab Corp 9181 1.845473 purple
6 City Service 8448 1.653201 purple
7 Sun Taxi 7701 1.457257 purple
8 Star North Management LLC 7455 1.392729 purple
9 Blue Ribbon Taxi Association Inc. 5953 0.998742 purple
10 Choice Taxi Association 5015 0.752697 purple
11 Globe Taxi 4383 0.586919 purple
12 Dispatch Taxi Affiliation 3355 0.317266 purple
13 Nova Taxi Affiliation Llc 3175 0.270050 purple
14 Patriot Taxi Dba Peace Taxi Associat 2235 0.023481 purple
15 Checker Taxi Affiliation 2216 0.018497 purple
In [24]:
top_taxi = top_16_companies.company_name.tolist()
df_company['company_name_tree'] = [x if x in top_taxi else 'others' for x in df_company['company_name']]
df_company['company_name_tree'].value_counts()
Out[24]:
others                                  48
Yellow Cab                               1
Taxi Affiliation Services                1
Nova Taxi Affiliation Llc                1
Taxi Affiliation Service Yellow          1
Choice Taxi Association                  1
Star North Management LLC                1
City Service                             1
Dispatch Taxi Affiliation                1
Globe Taxi                               1
Chicago Carriage Cab Corp                1
Checker Taxi Affiliation                 1
Flash Cab                                1
Medallion Leasin                         1
Blue Ribbon Taxi Association Inc.        1
Sun Taxi                                 1
Patriot Taxi Dba Peace Taxi Associat     1
Name: company_name_tree, dtype: int64
In [25]:
# General market view from the total sales point of view 
df_taxi_tree = df_company[['company_name_tree', 'trips_amount']].groupby('company_name_tree').sum().sort_values(by='trips_amount', ascending=False).reset_index()
df_taxi_tree=df_taxi_tree[df_taxi_tree['trips_amount']!=0]

sizes=df_taxi_tree.trips_amount.values.tolist()
labels = df_taxi_tree.apply(lambda x: str(x[0]) + "\n" + str(round(x[1])) , axis=1)


colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Draw Plot
plt.figure(figsize=(15,9), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# Decorate
plt.title('Market Structure of Taxi Companies in Chicago, November 2017')
#plt.axis('off')
plt.show()

The graphs show that Flash Cab is an undisputable leader on Chicago market with the market share twice as much as his closest competitor, Taxi Affiliation Services. The rest top 8 market companies have very close market shares without any distinct leaders among them. Then we see a slight gap and another group of 6 leaders closing the top.

graph: top 10 neighborhoods by number of dropoffs

In [26]:
fig,ax=plt.subplots(figsize=(13,10))
ax.vlines(x=top_10_neighborhoods.dropoff_location_name, ymin=0, ymax=top_10_neighborhoods.average_trips, color='purple', alpha=0.7, linewidth=2)
ax.scatter(x=top_10_neighborhoods.dropoff_location_name, y=top_10_neighborhoods.average_trips, s=75, color='lightgreen', alpha=0.7)


ax.set_title('Lolipop Chart for Top 10 Neighborhoods', fontdict={'size':16})
ax.set_ylabel('average number of rides')
ax.set_xticks(top_10_neighborhoods.dropoff_location_name)
ax.set_xticklabels(top_10_neighborhoods.dropoff_location_name,rotation=90, fontdict={'horizontalalignment':'right','size':14})

for  row in top_10_neighborhoods.itertuples():
    ax.text(row.dropoff_location_name, row.average_trips+30,s=round(row.average_trips,2), rotation=90, fontdict={'horizontalalignment':'center','size':10})
In [27]:
#top 10 locations in comparison to the other 25
top_35_neighborhoods = df_trips.nlargest(35,['average_trips'])
top_35_neighborhoods['colors'] = ['blue' if x<1259 else 'green' for x in top_35_neighborhoods['average_trips']]
top_35_neighborhoods['colors']
 
plt.figure(figsize=(10,11))
plt.hlines(y=top_35_neighborhoods.dropoff_location_name, xmin=0, xmax=top_35_neighborhoods.average_trips, colors=top_35_neighborhoods.colors, alpha=0.4, linewidth=10)
plt.show()

The graph shows top 10 destinations for drop-off (in green), the rest less popular destinations are marked in blue. Clearly people tend to take taxis to parks less often than to pubs and restaurants.

Top 10 neighborhoods vary greatly in the number of drop-offs. The most popular destination was Loop with 10727 drop-offs there, whereas people went to Sheffield & DePaul almost 9-fold less, 1259 drop-offs. The top 4 locations (Loop, River North, Streeterville, and West Loop) leave far behind all the rest of the localities. But all the top spots are very popular destinations with lots of eateries, fancy shops, theaters, dining and nightlife hotspots. O'Hare airport is only on the 5th popular place. Sheffield & DePaul closing the leading 10 is a popular place to live among young professionals, lots of them rent houses there. Based on the most popular destinations for taxi trips we can assume that most of the clients of the taxi companies are tourists and youngsters.

Testing hypotheses

In [28]:
#loading the dataset
df_airport = pd.read_csv('./project_sql_result_07.csv', sep=',')
df_airport.sample(7)
Out[28]:
start_ts weather_conditions duration_seconds
976 2017-11-25 10:00:00 Good 1584.0
218 2017-11-18 17:00:00 Bad 2100.0
784 2017-11-11 17:00:00 Good 2089.0
684 2017-11-04 17:00:00 Bad 2151.0
234 2017-11-04 05:00:00 Good 1620.0
117 2017-11-11 07:00:00 Good 1380.0
1011 2017-11-18 14:00:00 Good 3660.0
In [29]:
df_airport.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 16.8+ KB

The database does not contain any missing values but we need to change 'start_ts' type to datetime.

In [30]:
from datetime import datetime
df_airport['start_ts']=pd.to_datetime(df_airport['start_ts'], format='%Y.%m.%d') #changing data type from object to datetime
df_airport['start_ts'].head()
Out[30]:
0   2017-11-25 16:00:00
1   2017-11-25 14:00:00
2   2017-11-25 12:00:00
3   2017-11-04 10:00:00
4   2017-11-11 07:00:00
Name: start_ts, dtype: datetime64[ns]
In [31]:
df_airport['weather_conditions'].value_counts()
Out[31]:
Good    888
Bad     180
Name: weather_conditions, dtype: int64
In [32]:
#defining days of week
df_airport['dow'] = df_airport['start_ts'].dt.weekday
df_airport['dow'].value_counts()
Out[32]:
5    1068
Name: dow, dtype: int64

We have only Saturdays in the db.

In [33]:
#statistical summary
df_airport.describe()
Out[33]:
duration_seconds dow
count 1068.000000 1068.0
mean 2071.731273 5.0
std 769.461125 0.0
min 0.000000 5.0
25% 1438.250000 5.0
50% 1980.000000 5.0
75% 2580.000000 5.0
max 7440.000000 5.0

Destination between

In [34]:
df_airport.hist('duration_seconds', bins=30)

plt.show()

Destination between Loop and O'Hare Airport is 30 km thus we can drop outliers. The trip to ORD could not take less than 800 seconds (average speed should be 130 kmh with speed limit in the area 112 kmh). On the other hand the trip could not take more than 5000 sec (that implies an average speed of 20 kmh).

In [35]:
df_airport_no_outliers = df_airport.query('duration_seconds>=800 and duration_seconds<=5000')
df_airport_no_outliers.hist('duration_seconds', bins=30)

plt.show()

There are two strong picks: at 1300 sec. and 2400 sec. At this ppoint it can be explained by different road conditions. Some trips were twice faster than the others.

Stating hypothesis: </br>


(H0 has to state unchanged result)

H0: The average duration of rides from the Loop to O'Hare International Airport stays the same on rainy Saturdays.

H1: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.

In [36]:
#creating 2 db for rainy and sunny Saturdays
df_rainy = df_airport_no_outliers.query('weather_conditions == "Bad"')
df_rainy.shape
Out[36]:
(178, 4)
In [37]:
df_sunny = df_airport_no_outliers.query('weather_conditions == "Good"')
df_sunny.shape
Out[37]:
(878, 4)
In [38]:
#Getting rid of outliers defining upper fence for every trip group

#defining outliers with 3-sigma method for both db
#as we are calulating std for sample ddof is set to 1

std_score_rainy = np.std(df_rainy['duration_seconds'], ddof=1)

three_sigma_score_rainy_upper = round((df_rainy['duration_seconds'].mean() + std_score_rainy*3),2)



std_score_sunny = np.std(df_sunny['duration_seconds'], ddof=1)

three_sigma_score_sunny_upper = round((df_sunny['duration_seconds'].mean() + std_score_sunny*3),2)

print('99.7% of trips on rainy Saturdays last not more than ', three_sigma_score_rainy_upper,'seconds. \n99.7% of of trips on sunny Saturdays last up to ', three_sigma_score_sunny_upper,'seconds.')
99.7% of trips on rainy Saturdays last not more than  4533.7 seconds. 
99.7% of of trips on sunny Saturdays last up to  4111.24 seconds.
In [39]:
#setting df without outliers

df_rainy_no_outliers = df_rainy.query('duration_seconds<=@three_sigma_score_rainy_upper')

df_sunny_no_outliers = df_sunny.query('duration_seconds<=@three_sigma_score_sunny_upper')
In [40]:
df_rainy_no_outliers.shape
Out[40]:
(177, 4)
In [41]:
df_sunny_no_outliers.shape
Out[41]:
(874, 4)
In [42]:
#defining Variance for the two samples to define whether or not we can consider them as equal for t-test.
variance_rainy = np.var(df_rainy_no_outliers['duration_seconds'])
print('Variance for trips to airport on rainy Saturdays is ', variance_rainy)
Variance for trips to airport on rainy Saturdays is  446454.76746784133
In [43]:
variance_sunny = np.var(df_sunny_no_outliers['duration_seconds'])
print('Variance for trips on sunny Saturdays is ', variance_sunny)
Variance for trips on sunny Saturdays is  463236.5557760674


Explanation of the method choise for hypothesis testing

Thus we have two samples of continuous data, the samples are drawn from a normally distributed data with different variances we will conduct a Welch test to test hypothesis (in Python it will be a two-tailed T-test with correction 'equal_var = False').

A critical statistical significance level will be set at 0.05 as it is a commonly-accepted level and as we do not conduct medical testing, higher acuracy is not required.

In [44]:
alpha = .05 
# "equal_var = False" as previous calculations proved that the 2 samples have different variance value. 
results =  st.ttest_ind(df_rainy_no_outliers['duration_seconds'], df_sunny_no_outliers['duration_seconds'], equal_var = False) 

print('p-value:', results.pvalue/2) 
#we are making two-tailed test as we are checking whether the average travel time to the airport depends on weather conditions, 
#no matter if it gets longer or shorter
 
if (results.pvalue/2 < alpha):
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")
p-value: 2.977952585818491e-14
We reject the null hypothesis

The data provides sufficient evidence, given the significance level we selected (5%), to reject the null hypothesis. Therefore, we can conclude that average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays (μ1 != μ2).

Conclusion

We looked into 3 data sets, defined the market leaders among taxi companies as of Novemmber 2017, popular taxi destinations and tested hypothesis.
As to the taxi companies we have figured out that there are over 60 companies on market and Flash Cab is an undisputable leader on Chicago market.
Also we have checked 10 top popular destinations for taxi rides in the area. It came as no surprise that night clubs, restaurants and airport are much more popular destinations than parks and museums for driving by taxi to.
Moreover we have tested the hypothesis that weather conditions influence an average ride duration to the airport. We have come to conclusion that average ride time with good weather conditions differs from the one with bad weather. As soon as we checked only trips during Saturdays influence of traffic jams tends to zero.