Below is the market research of an open-source data on restaurants in LA with the purpose to open a small robot-run cafe in Los Angeles.
The main aim is to study the current market conditions.
Firstly we will preprocess the data on the restaurants and other eateries in LA.
Then we will study the data: choose the most popular restaurant streets and the least popular ones, study the current marker structure (establishment types and their shares), how number of seats vary depending on type and place of a restaurant.
Finally we will draw some conclusions.
Part 1. Data preprocessing
Part 2. Data analysis
Proportions of the various types of establishments
Proportions of chain and nonchain establishments
Which type of establishment is typically a chain
What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?
Average number of seats for each type of restaurant
Street names
Top ten streets by number of restaurants
Streets that only have one restaurant
Streets with a lot of restaurants: distribution of the number of seats. Trends.
Overall conclusion and recommendations on restaurant type and number of seats. Possibility of developing a chain.
import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import re
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
df = pd.read_csv('./rest_data_us.csv', sep=',')
display(df.head())
display(df.sample(6))
df.info(memory_usage='deep')
df['object_name'].value_counts()
df['object_type'].value_counts()
df['chain'].value_counts()
#Checking for duplicated data
df.duplicated().sum()
#stasistical summary of the number of seats
df['number'].describe()
df['number'].hist(bins=30, log=True)
plt.show()
#Calculatinug the 95th and 99th percentiles for number of places in a establishment.
np.percentile(df['number'], [95, 99])
#statistical summary for categorical variables
df.describe(include=['object'])
#looking into establiushments placed in the most popular address:
pop_adress = df.query('address =="3607 TROUSDALE PKWY"')
pop_adress
df['object_name'].value_counts().count()
#Checking data for zeros:
for i in df.columns:
print(i, len(df[df[i]==0]))
# defining columns with missing values
missing_list=[]
for x in df:
if len(df[df[x].isnull()])>0:
missing_list.append(x)
print('The following columns have missing values: ', missing_list)
# calculating percentage of missing values in every column
missing_percentage=[]
for x in missing_list:
missing_percentage.append([x,(len(df[df[x].isnull()])/len(df)*100)])
missing_percentage=pd.DataFrame(missing_percentage, columns=['column','missing_values_percentage%'])
missing_percentage.sort_values(by=['missing_values_percentage%'], ascending=False)
#cheking the rows withoud data in 'chain' column:
df.loc[df['chain'].isnull()]
We have defined that missing values are only in 'chain' column. The share of this data is low (0.03%). As it is lower than 2% we can drop these lines without scewing futher calculations.
# dropping the lines with missing values
df = df.dropna().reset_index(drop=True)
#df=df[df.notnull().all(axis=1)]
#cheking if NaNs are dropped:
df[df.isnull().any(axis=1)] #displays at least one missing value in any column
We have saved 0.3 Mb of the memory used initially by changing the data type.
#Changing data types from object to category to low the memory usage:
#from pandas.api.types import CategoricalDtype
df = pd.read_csv('./rest_data_us.csv', sep=',',
dtype={'object_name': 'str', 'address': 'str', 'object_type': 'category'})
df.info(memory_usage='deep')#checking how much is memory usage now.
The data for analysis contains 6 rows with 9651 entries.
There are 8672 unique names (out of 9651) of the Restaurants/cafes and 8517 unique adresses (out of 9651). The biggest chain (the most frequently encountered name) has 47 establishments; the name is 'THE COFFEE BEAN & TEA LEAF'. And the place with the biggest number of establishments (11 ones) has address as follows: 3607 TROUSDALE PKWY.
There are 6 types of the establishments in the data: Restaurant, Fast Food, Cafe, Pizza, Bar, Bakery .
Number of seats in the establishments vary from 1 to 229 with mean value of 44 seats. Middle 50% of establishments have from 14 to 46 seats. Not more than 5% of establishments have over 148 seats and only 1% have over 214 seats.
No duplicates have been found in data set.
Also the df contains information if the establishment is a chain or not. We had 3 establishments with missing values on this data and we dropped this lines without harm to the rest of the data as the percentage of missing values was less than 3%.
We have changed some data types from object to category to improve the memory usage (it dropped from 2.4 Mb to 1.8 Mb): columns 'object_type' and 'chain'.
As we have just figured it out in the previous part, there are 6 types of establisments in the region. Below is the graphical interpretation of the data.
#creating a table
df_types = df.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types.columns = ['object_type' ,'number']
df_types.head()
#biulding a bar chart
sns.set_palette("Set2")
plt.title('Number of different types of establishments')
ax = sns.barplot(x="object_type", y="number", data=df_types)
ax.set_xticklabels(df_types.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types['number'].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
#pie chart for types of establishments
labels = df_types['object_type']
values = df_types['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of different types of establishments')
fig.show()
We can see that the largest market share belongs to restaurants, they occupy 75% of the market. Fast food is the second large type but is 7 times smaller than the main leader with the market share of 11%. Other types (cafe, pizza, bar and bakery) have minor shares from 4.5% to 2.9% of the establishments.
As we have already figured that there just two types of this data: wether the establishment belongs to a certain chain or not. The market shares for these types look as follows:
df_chains = df.groupby(['chain']).agg({'id': 'count'}).reset_index()
df_chains.columns = ['chain' ,'number']
#df_chains['chain'] = df_chains['chain'].astype(str) #changing data type
#df_chains['chain'].replace({"False": "Not Chain", "True": "Is Chain"}, inplace=True) #changing value names
df_chains.head()
#pie chart for types of establishments referring to a chain
labels = df_chains['chain']
values = df_chains['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of Establishments')
fig.show()
The biggest market share belongs to establishments that do not refer to any specific chain and are independant, riginal and distinctive ones. Still chains occupy almost half of the market (38%).
We will create two data sets dor chain and non-chains and investigate the proportions.
#data set for non-chains
df_types_no_chains = df.query('chain == False').groupby(['object_type']).agg({'id': 'count'}).sort_values(by='id', ascending=False).reset_index()
df_types_no_chains
#data set for chains
df_types_chains = df.query('chain == True')
df_types_chains = df_types_chains.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types_chains.columns = ['object_type' ,'number']
df_types_chains.sort_values(by='number', ascending=False).reset_index(drop=True)
#visualizing the absolute values
sns.set_palette("Set2")
plt.title('Number of different types of establishments (chains)')
ax = sns.barplot(x="object_type", y="number", data=df_types_chains)
ax.set_xticklabels(df_types_chains.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types_chains['number'].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
#pie chart for market structure/ types of establishments: chais vs non-chains (visualizing the relative values)
from plotly.subplots import make_subplots
labels1 = df_types_chains['object_type']
labels2 = df_types_no_chains['object_type']
value1 = df_types_chains['number']
value2 = df_types_no_chains['id']
fig = make_subplots(2, 2, specs=[[{'type':'domain'}, {'type':'domain'}],
[{'type':'domain'}, {'type':'domain'}]],
subplot_titles=['Chains', 'Non-chains'])
fig.add_trace(go.Pie(labels=labels1, values=value1, scalegroup='one',
name="Chains"), 1, 1)
fig.add_trace(go.Pie(labels=labels2, values=value2, scalegroup='one',
name="Non-chains"), 1, 2)
fig.update_layout(title_text='Market Shares of different types of establishments: chais vs non-chains')
fig.show()
An alternative charts is below.
#comparing absolute values
plt.figure(figsize=(10, 8))
splot=sns.countplot(x="object_type", hue="chain", data=df)
plt.xlabel('Type of establishments', size=14)
plt.title("Shares and number of different types of establishments: chais vs non-chains", size=18)
plt.ylabel('Number of establishments', size=14)
for p in splot.patches:
splot.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
size=12,
xytext = (0, -12),
textcoords = 'offset points')
plt.show()
#Calculating the percentage taking the whole market as 100%
plt.figure(figsize=(10, 8))
sns.set_palette("Set2")
plt.title('Shares and number of different types of establishments: chais vs non-chains', size=14)
ax = sns.countplot(x="object_type", hue="chain", data=df)
plt.ylabel('Number of establishments', size=12)
def with_hue(plot, feature, Number_of_categories, hue_categories):
a = [p.get_height() for p in plot.patches]
patch = [p for p in plot.patches]
for i in range(Number_of_categories):
#total = feature.value_counts().values[i]
total = float(len(df))
for j in range(hue_categories):
percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
with_hue(ax, df.object_type, 6,2)
Number of establishments that belong to different chains is lower than the absolule number of original establishments. Although if we look at relative values we can see that the market structure differs.
Restaurants hold the leading position both among chains and non-chains, although their market share among chains is significatly less than among non-chains (62% vs 83%).
Share of chain fast food establishments is twice as big as non-chains (16.5% vs 8%).
Share of chain bakeries and chain cafes are aproximately equal (7-8%). Wherease Non-chain bakery seems to be extinct in LA, and non-chain cafe is also a rare find (only 3% of the market).
Pizzas have aproximately the same share both in chains and non-chains(3-4%).
The least popular type of chain establishment is bar with only 2% of market share.
We will look separetly at distribution of number of establishments per chain and distribution of average number of seats per establishment. Then we will combine these distibutions to draw some conclusions.
df_chain = df.query('chain==True')
#df_chain = df_chain.drop(['address', 'chain', 'object_type'], axis=1)
#df_chain = df_chain.groupby('object_name')['id'].count()['number'].median()
df_chain = df_chain.groupby('object_name').agg({'id': ['count'], 'number':['median']}).reset_index()
df_chain.columns = ['chain_name','number_of_spots', 'seats_average']
df_chain.sample(5)
#plotting first distribution
df_chain['number_of_spots'].hist(bins=30, log=True)
plt.title('Average number of establishments within a chain - distribution')
plt.xlabel('number of establishments')
plt.ylabel('density')
plt.show()
The histogram shows that most of the distribution lies within interval between 1 and 15 spots per chain. Also we can see that there are some outliers with over 30 establishments within a chain. We will check them up.
#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers).
np.percentile(df_chain['number_of_spots'], [95, 99])
Not more than 5% of chains contain over 3 spots and only 1% of chains include over 6 establishments. Consequently chains with over 6 spots can be viewed as outliers.
df_chain.describe()
#counting outliers
df_chain.query('number_of_spots>6').count()
#checking outliers
df_chain.query('number_of_spots>6').sort_values(by='number_of_spots', ascending=False)
We have got 23 chains with more than 6 establishments. The names of this chains are well known, no surprizes here.
df_chain.query('number_of_spots==1')
There are 2240 chains with only one establishment.
#plotting distribution of average number of seats per establishment
sns.distplot(df_chain['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()
#plotting distribution of average number of seats per establishment excluding outliers
df_chain_no_outliers = df_chain.query('1<number_of_spots<6')
sns.distplot(df_chain_no_outliers['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment (without outliers)')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()
Distribution looks almost the same.
#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers).
np.percentile(df_chain['seats_average'], [95, 99])
Not more than 5% of chains contain over 142 seats and only 1% of chains have over 205 seats.
#combining the distributions
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()
#distributions without outliers
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain_no_outliers)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()
There are lots of chains on the market (38% of all establishments). But the chains are mostly small: only 23 chains (1% of them) have over 6 establishments, only 5% of chains have over 3 establishments. We have 2240 chains with only 1 establishment. It might be a labeling error in the data or a those chains have only one branch in LA.
Distribution of number of seats looks more normal. Middle 50% of establishments have from 14 to 45 seats, average number of seats is 26. Not more than 5% of chains contain over 142 seats.
The current situation on the market looks as follows: chains are small, most of them have only one establishment in LA; there are lots of average sized chain establishments (26 seats on average); only 2 chains have over 30 establishments. Typical caracteristic for a chain in LA would be: 1-2 branches with 26 seats on average.
Firstly we will have a look at the average number of seats in all establishments.
plt.title('Median number of seats for each type of establishments')
ax = sns.barplot(x="object_type", y="number", data=df) #mean value
for p in ax.patches:
ax.annotate(format(p.get_height(), '.1f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
Clearly the restaurants are the the largest establishments with 48 seats on average. The smallest are bakeries (average 22 seats) and cafes (25 seats).
Now we will check the median number of seats for all establishments (to get rid of the influence of outliers).
from numpy import median
#plt.figure(figsize=(8, 8))
plt.title('Median number of seats for each type of establishments')
ax = sns.barplot(x="object_type", y="number", data=df, estimator=median) #median value
for p in ax.patches:
ax.annotate(format(p.get_height(), '.1f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
As we have gotten rid of the influence of outliers, the average numbers for the leaders dropped: for restaurants it is now 29 seats on average. The bars have almost the same average number. The average number of seats for all types of establishments is from 18 to 29 seats.
Now we will check the difference in average number of seats in chains in comparison to non-chain establishments.
#plotting the bar chart for chains vs original establishments
plt.figure(figsize=(10, 8))
plt.title('Average number of seats for each type of establishments')
ax = sns.barplot(x="object_type", y="number", data=df, hue='chain', estimator=median)
for p in ax.patches:
ax.annotate(format(p.get_height(), '.1f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
plt.xlabel("Types of establishments", size=14)
plt.ylabel("Number of seats", size=14)
plt.show()
Looking into restaurant data separetely:
df_rest = df.query('object_type=="Restaurant"')
sns.set(style="whitegrid")
sns.color_palette("deep")
plt.figure(figsize=(10, 3))
plt.title('Average number of seats for restaurants')
ax = sns.barplot(x="number", y="object_type", data=df_rest, hue='chain', estimator=median)
for p in ax.patches:
width = p.get_width()
plt.text(5+p.get_width(), p.get_y()+0.55*p.get_height(),
'{:1.2f}'.format(width),
ha='center', va='center')
plt.xlabel("Number of seats", size=14)
plt.ylabel("Type of restaurant", size=14)
plt.show()
There is almost no difference in average number of seats for cafes and restaurants. Wherease for the rest of the type the differece is quite big. The most dramatic difference we can see in capacity of bakeries and bars. Chain bakeries are much bigger (by 18 seats on average), although chain bars are much smaller (by 12 seats on average).
In this section we are adding a column containing only street name, gettig rid of additional information in address of establishments.
#Filtering out the street name with usaddress library
#importing library
!pip install usaddress
import usaddress
def cleaning_check(raw):
raw_address=usaddress.parse(raw)
dict_address={}
for i in raw_address:
dict_address.update({i[1]:i[0]})
#checking for normal case with street and street name type
if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
return clean_adress
else:
return 'no street or street type'
df['street_check']=df.address.apply(cleaning_check).reset_index(drop=True)
df.head()
#checking for lines that got no street name through cleaning_check function
df[df['street_check']=='no street or street type']
#parsing one of the problem addresses to investigate the root of the problem
usaddress.parse(df.iloc[1888]['address'])
usaddress.parse(df.iloc[9622]['address'])
usaddress.parse(df.iloc[1]['address'])
def cleaning_final(raw):
if raw.startswith('OLVERA'):
clean_adress='OLVERA,Los Angeles,USA'
elif raw.startswith('1033 1/2 LOS ANGELES'):
clean_adress='1033 1/2 LOS ANGELES ST,Los Angeles,USA'
else:
raw_address=usaddress.parse(raw)
dict_address={}
for i in raw_address:
dict_address.update({i[1]:i[0]})
if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
return clean_adress
else:
clean_adress1= str(dict_address['StreetName'])
return clean_adress1
# return clean_adress
df['street_name']=df.address.apply(cleaning_final).reset_index(drop=True)
df.sample(10)
#check for corrections in the problem addresses
df.query('street_name == "OLVERA ST"')
#check for corrections in the problem addresses
df.query('street_name == "LOS ANGELES ST"')
#dropping extra column:
df = df.drop('street_check', axis=1).reset_index(drop=True)
df.head()
df['street_name'].value_counts()
We have got a list of 529 unique street names and formed a separate column 'street_name'.
df['street2'] = (df['address']
.replace('(^[0-9]+)','',regex=True)
.replace('[0-9 #]+$','',regex=True)
.replace('[-#.&,]','',regex=True)
.replace('["\\/]','',regex=True)
.replace('(^| ).(( ).)*( |$)','',regex=True)
.replace('( ){2}','',regex=True)
.replace('^ *','',regex=True)
.replace(['FLR', 'FL', 'STE', 'SC', 'UNIT', 'APT', 'CPCE', 'MEZZ', 'LOBBY', 'LBBY', 'ROOM', 'LOWR', 'BLDG', 'AU', '101C', '2B'],'',regex=True)
)
df['street2'].value_counts()
#Check up for the long street name
df.query('street2=="ECHO PARK AVE"')
#As I have chosen to proceed with the 1st method (column 'street_name') I will drop the dropping extra column:
df = df.drop('street2', axis=1).reset_index(drop=True)
df.head()
[back to top](#top)
[previous part](#8)
We will define the most popular streets for all establishments firstly and then the most popular ones for restaurants.
streets_by_est = df.groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()
#renaming columns
streets_by_est.columns = ['street_name','establishments', 'avg_seats' ]
streets_by_est['avg_seats'] = streets_by_est['avg_seats'].astype(int)
streets_by_est.sort_values(by='establishments', ascending=False).head(10)
#saving top 10 streets by number of establishments to a separate data frame
top_10_streets = streets_by_est.query('establishments>=253').sort_values(by='establishments', ascending=False)
top_10_streets
plt.figure(figsize=(9,5))
# plot barh chart
ax = sns.barplot(top_10_streets.street_name, top_10_streets.establishments, palette='mako')
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="Streets", ylabel='Number of establishments')
ax.set_xticklabels(top_10_streets.street_name)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(top_10_streets["establishments"].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.title('Top 10 streets with the biggest number of establishments')
plt.show()
We got 10 the most popular streets with the biggest number of establishments.
Now we will check the most popular streets for restaurants.
top_10_streets['street_name'].unique()
streets_by_restaurants = df.query('object_type=="Restaurant"').groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()
streets_by_restaurants.columns = ['street_name','establishments', 'avg_seats' ] #renaming columns
streets_by_restaurants['avg_seats'] = streets_by_restaurants['avg_seats'].astype(int)
streets_by_restaurants.sort_values(by='establishments', ascending=False).head(10)
We have got the same list of the streets with a slight difference in ranging.
We can assume that this is the list of the most popular and luxury streets in LA where most of establishments are restaurants.
These streets represent the most highly competitive region in LA.
#streets with only one establishment
one_establishment_str = streets_by_est.query('establishments==1').count()
one_establishment_str
#streets with one restaurant:
one_rest_str = streets_by_restaurants.query('establishments==1').count()
one_rest_str
According to our data there are 238 streets with only one establishment and 227 streets where the only establishment is a restaurant. In other words if there is an only establishment on the street, most likety is will be a restaurant.
Firstly we will look at the distribution of number of establishments in the streets to define "streets with a lot of restaurants".
streets_by_est[['establishments', 'avg_seats']].describe()
#Calculatinug the percentiles for number of establishments per street.
np.percentile(streets_by_est['establishments'], [85, 95, 99])
#poltting distribution
plt.figure(figsize=(9,5))
sns.distplot(streets_by_est['establishments'])
plt.title('number of establishments per street')
plt.show()
We can see that middle 50% of streets have 1-9 restaurants. Only 15% of streets have over 23 restaurants and only 5% of streets have over 90 restaurants. To define to point that means "a lot" we will perfom additional calculations.
We will define how popular is the street in relation to the mean value of establishments registrated there. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the number of establishments in the street is identical to the mean value.
#counting z-score for number of establishments to define distribution
streets_by_est['establishments_z']=(streets_by_est['establishments']-streets_by_est['establishments'].mean())/streets_by_est['establishments'].std()
streets_by_est= streets_by_est.sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est
Below there is a chart with a breaking point. There are much more streets to the both sides, this one is just an illustation of the idea of z-score ranging.
#distinguishing z-score by colors: red for negative values, blue for positive ones
streets_by_est_z = streets_by_est.query('-0.1 <=establishments_z<=0.1').sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est_z['colors'] = ['red' if x<0 else 'blue' for x in streets_by_est_z['establishments_z']]
plt.figure(figsize=(14,10))
plt.hlines(y=streets_by_est_z.street_name, xmin=0, xmax=streets_by_est_z.establishments_z, colors=streets_by_est_z.colors, alpha=0.4, linewidth=10)
plt.title('Defining streets with a lot of restaurants')
plt.xlabel("z-score", size=14)
plt.show()
Now when we have defined the breaking point (Z=0; establishments = 22), we will form a new data frame containing only streets with a lot of restaurants.
#filtering out streets with many restaurants:
streets_many_rest = streets_by_est.query('establishments_z>0')
streets_many_rest.head()
Now we will move to study of seats distribution.
streets_many_rest.describe()
#Calculatinug the percentiles for number of seats per restaurant.
np.percentile(streets_many_rest['avg_seats'], [85, 95, 99])
sns.distplot(streets_many_rest['avg_seats'], bins=10)
plt.title('Distribution of number of seats in restaurants on the highly competitive streets')
plt.show()
General distrubution of seats number looks as follows: average number is 26, middle 50% of restaurants have from 21 to 31 seats, maximum number is 100 seats. Only 5% of restaurants have more than 36 seats. There are no restaurants in the popular streets with less than 14 seats.
Now we will use qcut() method to split the streets into 5 equal groups.
pd.qcut(streets_many_rest['establishments'], 5)
#splitting streets into 5 categories
def cat(est):
if est <= 26:
return '19-26'
if est > 26 and est <= 42:
return '27-42'
if est > 42 and est <= 74:
return '43-74'
if est > 74 and est <= 118:
return '75-118'
return 'over 118'
streets_many_rest['establishments_number'] = streets_many_rest['establishments'].apply(cat)
streets_many_rest.head()
Now we have 5 groups of streets and we will investigate distribution of seats for every group.
plt.figure(figsize=(22,8))
sns.stripplot(x="establishments_number", y="avg_seats", data=streets_many_rest)
plt.title('Scatter plot for seats per different number of restaurants per street', size=14)
plt.show()
plt.figure(figsize=(22,8))
sns.violinplot(x="establishments_number", y="avg_seats", data=streets_many_rest, palette='rainbow')
plt.show()
There is no clear difference between the groups in the niumber of seats. We can only say that the biggest diference is between restaurants on the streets with 19-26 establishments distribution with a big tail to the right, more outliers, the largest restaurants are in this category) and the most popular streets with over 118 restaurant (distribution is smaller, from 22 to 42 seats).
The streets with 43-74 establishments tend to have restaurants with capasity above average (28-34).
est_seats = streets_many_rest.drop(['establishments_z'], axis=1) #dropping extra column
sns.pairplot(est_seats, hue='establishments_number')
plt.title('Paired distributions of average number of seats on different street')
plt.show()
#pair plot without 3rd parameter
est_seats1 = streets_many_rest.drop(['establishments_z', 'establishments_number'], axis=1)
sns.pairplot(est_seats1)
plt.title('Paired distributions of average number of seats on different street')
plt.show()
Paired distribution confirms that the more establishments there are in the street the closer they are to an average size.
The smallest and the largest restaurants are in the streets with 19-26 establishments.
This can be viewed as a general trend.
We have looked into the data on restaurants in LA and figured out the following:
- 75% of establishments are restaurants, 11% fast_food; bakery is the least popular type of establishments
- 38% of establishments belong to a chain
- structure of market for chain and non-chains differs see pic
- lots of chains have only 1 establishment in LA
- a typical characteristic for a chain in LA would be: 1-2 branches with 26 seats on average.
- an average number of seats for all types of establishments varies from 18 to 29 seats
- based on the addresses of the establishments we have figured out the most popular streets: SUNSET, 'WILSHIRE', 'PICO', 'WESTERN', 'FIGUEROA', 'OLYMPIC', 'VERMONT', 'MONICA', '3RD', 'HOLLYWOOD.see chart
- there are 185 streets with only one establishment. These are the streets that are not as competitive as the top ones but still they have their customer base and formed demand.
- we have defined streets with lots of restaurants based on the average number of restaurants (which is 21 restaurant per street).
- On average there are 26 seats in a restaurant that is placed in the popular streets (middle 50% of them have from 14 to 31 seats), maximum number of seats is 44. Only 1% of restaurants have over 40 seats.
- Paired distribution shows that the more establishments there are in the street the closer they are to an average size. The smallest and the largest restaurants are in the streets with 22-30 establishments (among the popular streets) chart.
- average number of seats is higher in chain bakeries, cafes and fast food, for the restaurants the average number of seats is approximately the same.
General conclusion. LA restaurant market is a highly competitive medium.
Based on the previous analysis we can offer 2 main strategies.
Stategy that requires more investments
- open original (non-chain) restaurant on one of the most popular streets of LA. These are areas with highly effective demand and high foot traffic.
- type of the establishment - restaurant. Most of the establiushments there are restaurants so the area has already formed demand and customers have certain expectations visiting this streats.
- number of seats 20-26 as this is an average number for most types of establishments in the area. Most likely this is the optimal rate of occupasy and payback of the premises.
- no-chain, highlight on the unique experience of the place.
Strategy that requires less investments
- open a restauranton a less popular area of LA with fewer restaurants. For example it could be the university area (dynamics, philosophy, vibes, open-mindnes of the area suits the idea of an innovative robot-run cafe without man labor) + expo area (large people flow, dynamics, innovations-inspired and driven medium).
- relatively low competitive area, only 2 fast-food eateries.
- type of the establishment - fast-food. It is cheaper to operate, no high-class Michelin chefs needed, suits the dynamics of the area. People already got used to a sertain type of food and have similar expectations (fast food)
- develop a chain:
- additional income for selling franchise license,
- economies of scale
- number of seats from 24 and up. Start at the lower point (24 seats is an average number for chains in LA), but there should be an opportunity to enlarge the capasity of the premises in the future. Supposedly rental fare in the such area is lower than in the top popular streets. AThus the restaurant should operate on a large scale and we could use economies of scale.
To make final desisions and choose the strategy we need some additional information:
In order to choose the final strategy we need additional thorough calulations. In current circumstances with Covid pandemic the idea of a robot-run eatery without people involved looks like a very timely one.