Below is the market research of an open-source data on restaurants in LA with the purpose to open a small robot-run cafe in Los Angeles.
The main aim is to study the current market conditions.

Firstly we will preprocess the data on the restaurants and other eateries in LA.

Then we will study the data: choose the most popular restaurant streets and the least popular ones, study the current marker structure (establishment types and their shares), how number of seats vary depending on type and place of a restaurant.
Finally we will draw some conclusions.

In [1]:
import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import re
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

df = pd.read_csv('./rest_data_us.csv', sep=',')

Part 1. Data preprocessing

In [2]:
display(df.head())
id object_name address chain object_type number
0 11786 HABITAT COFFEE SHOP 3708 N EAGLE ROCK BLVD False Cafe 26
1 11787 REILLY'S 100 WORLD WAY # 120 False Restaurant 9
2 11788 STREET CHURROS 6801 HOLLYWOOD BLVD # 253 False Fast Food 20
3 11789 TRINITI ECHO PARK 1814 W SUNSET BLVD False Restaurant 22
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20
In [3]:
display(df.sample(6))
id object_name address chain object_type number
9384 21170 STARBUCKS #51056 1106 CORNWELL True Cafe 31
5843 17629 PENTHOUSE LOUNGE 1347 S ATLANTIC BLVD False Restaurant 81
4725 16511 THE GOOD PIZZA 7929 EMERSON AVE False Pizza 2
6422 18208 LA SERENATA 1842 E 1ST ST True Restaurant 44
8945 20731 LITTLE CEASARS # 5945 10714 S WESTERN AVE # 106A False Restaurant 4
3648 15434 ISAAN CLASSIC 2701 W SUNSET BLVD False Restaurant 42

Checking data types

In [4]:
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           9651 non-null   int64 
 1   object_name  9651 non-null   object
 2   address      9651 non-null   object
 3   chain        9648 non-null   object
 4   object_type  9651 non-null   object
 5   number       9651 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 1.5 MB
In [5]:
df['object_name'].value_counts()
Out[5]:
THE COFFEE BEAN & TEA LEAF         47
SUBWAY                             31
DOMINO'S PIZZA                     15
KENTUCKY FRIED CHICKEN             14
WABA GRILL                         14
                                   ..
TASTE NOODLE HOUSE                  1
CAFE AT THE END OF THE UNIVERSE     1
WIMPY'S BURGER                      1
DEL TACO #988                       1
ANGELES                             1
Name: object_name, Length: 8672, dtype: int64
In [6]:
df['object_type'].value_counts()
Out[6]:
Restaurant    7255
Fast Food     1066
Cafe           435
Pizza          320
Bar            292
Bakery         283
Name: object_type, dtype: int64
In [7]:
df['chain'].value_counts()
Out[7]:
False    5972
True     3676
Name: chain, dtype: int64

Checking for duplicated data

In [8]:
#Checking for duplicated data
df.duplicated().sum()
Out[8]:
0
In [9]:
#stasistical summary of the number of seats
df['number'].describe()
Out[9]:
count    9651.000000
mean       43.695161
std        47.622874
min         1.000000
25%        14.000000
50%        27.000000
75%        46.000000
max       229.000000
Name: number, dtype: float64
In [10]:
df['number'].hist(bins=30, log=True)
plt.show()
In [11]:
#Calculatinug the 95th and 99th percentiles for number of places in a establishment. 
np.percentile(df['number'], [95, 99])
Out[11]:
array([148., 214.])
In [12]:
#statistical summary for categorical variables
df.describe(include=['object'])
Out[12]:
object_name address chain object_type
count 9651 9651 9648 9651
unique 8672 8517 2 6
top THE COFFEE BEAN & TEA LEAF 3607 TROUSDALE PKWY False Restaurant
freq 47 11 5972 7255
In [13]:
#looking into establiushments placed in the most popular address:
pop_adress = df.query('address =="3607 TROUSDALE PKWY"')
pop_adress
Out[13]:
id object_name address chain object_type number
457 12243 USC MAIN KITCHEN-BASEMENT 3607 TROUSDALE PKWY True Restaurant 227
462 12248 TRADITIONS PUB 3607 TROUSDALE PKWY False Restaurant 197
480 12266 PANDA EXPRESS 3607 TROUSDALE PKWY False Fast Food 40
858 12644 MORETON FIGS RESTAURANT/BAR 3607 TROUSDALE PKWY True Bar 107
863 12649 VERDE 3607 TROUSDALE PKWY False Restaurant 39
888 12674 CALIFORNIA PIZZA KITCHEN 3607 TROUSDALE PKWY True Pizza 44
2610 14396 THE COFFEE BEAN & TEA LEAF 3607 TROUSDALE PKWY True Cafe 43
3268 15054 THE HABIT BURGER GRILL 3607 TROUSDALE PKWY False Fast Food 43
6311 18097 LEMONADE 3607 TROUSDALE PKWY True Restaurant 138
7698 19484 LEMONADES 3607 TROUSDALE PKWY True Restaurant 111
7756 19542 COFFEE BEAN & TEA LEAF 3607 TROUSDALE PKWY True Cafe 31
In [14]:
df['object_name'].value_counts().count()
Out[14]:
8672

Checking for missing values

In [15]:
#Checking data for zeros:
for i in df.columns:
    print(i, len(df[df[i]==0]))
id 0
object_name 0
address 0
chain 5972
object_type 0
number 0
In [16]:
# defining columns with missing values 
missing_list=[]
for x in df:
    if len(df[df[x].isnull()])>0:
        missing_list.append(x)
print('The following columns have missing values: ', missing_list)
The following columns have missing values:  ['chain']
In [17]:
# calculating percentage of missing values in every column
missing_percentage=[]
for x in missing_list:
    missing_percentage.append([x,(len(df[df[x].isnull()])/len(df)*100)])
missing_percentage=pd.DataFrame(missing_percentage, columns=['column','missing_values_percentage%'])
missing_percentage.sort_values(by=['missing_values_percentage%'], ascending=False)
Out[17]:
column missing_values_percentage%
0 chain 0.031085
In [18]:
#cheking the rows withoud data in 'chain' column:
df.loc[df['chain'].isnull()]
Out[18]:
id object_name address chain object_type number
7408 19194 TAQUERIA LOS 3 CARNALES 5000 E WHITTIER BLVD NaN Restaurant 14
7523 19309 JAMMIN JIMMY'S PIZZA 1641 FIRESTONE BLVD NaN Pizza 1
8648 20434 THE LEXINGTON THEATER 129 E 3RD ST NaN Restaurant 35

We have defined that missing values are only in 'chain' column. The share of this data is low (0.03%). As it is lower than 2% we can drop these lines without scewing futher calculations.

In [19]:
# dropping the lines with missing values
df = df.dropna().reset_index(drop=True) 
#df=df[df.notnull().all(axis=1)]
In [20]:
#cheking if NaNs are dropped:
df[df.isnull().any(axis=1)] #displays at least one missing value in any column
Out[20]:
id object_name address chain object_type number

Changing data types

We have saved 0.3 Mb of the memory used initially by changing the data type.

In [21]:
#Changing data types from object to category to low the memory usage:

#from pandas.api.types import CategoricalDtype
df = pd.read_csv('./rest_data_us.csv', sep=',',
                  dtype={'object_name': 'str', 'address': 'str', 'object_type': 'category'}) 

df.info(memory_usage='deep')#checking how much is memory usage now. 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           9651 non-null   int64   
 1   object_name  9651 non-null   object  
 2   address      9651 non-null   object  
 3   chain        9648 non-null   object  
 4   object_type  9651 non-null   category
 5   number       9651 non-null   int64   
dtypes: category(1), int64(2), object(3)
memory usage: 1.2 MB

Conclusion


The data for analysis contains 6 rows with 9651 entries.

There are 8672 unique names (out of 9651) of the Restaurants/cafes and 8517 unique adresses (out of 9651). The biggest chain (the most frequently encountered name) has 47 establishments; the name is 'THE COFFEE BEAN & TEA LEAF'. And the place with the biggest number of establishments (11 ones) has address as follows: 3607 TROUSDALE PKWY.

There are 6 types of the establishments in the data: Restaurant, Fast Food, Cafe, Pizza, Bar, Bakery .

Number of seats in the establishments vary from 1 to 229 with mean value of 44 seats. Middle 50% of establishments have from 14 to 46 seats. Not more than 5% of establishments have over 148 seats and only 1% have over 214 seats.

No duplicates have been found in data set.

Also the df contains information if the establishment is a chain or not. We had 3 establishments with missing values on this data and we dropped this lines without harm to the rest of the data as the percentage of missing values was less than 3%.

We have changed some data types from object to category to improve the memory usage (it dropped from 2.4 Mb to 1.8 Mb): columns 'object_type' and 'chain'.

Part 2. Data analysis

Proportions of the various types of establishments

As we have just figured it out in the previous part, there are 6 types of establisments in the region. Below is the graphical interpretation of the data.

In [22]:
#creating a table 
df_types = df.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types.columns = ['object_type' ,'number']
df_types.head()
Out[22]:
object_type number
0 Bakery 283
1 Bar 292
2 Cafe 435
3 Fast Food 1066
4 Pizza 320
In [23]:
#biulding a bar chart
sns.set_palette("Set2") 
plt.title('Number of different types of establishments')
ax = sns.barplot(x="object_type", y="number", data=df_types)
ax.set_xticklabels(df_types.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types['number'].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()
In [24]:
#pie chart for types of establishments
labels = df_types['object_type']
values = df_types['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of different types of establishments')
fig.show()

We can see that the largest market share belongs to restaurants, they occupy 75% of the market. Fast food is the second large type but is 7 times smaller than the main leader with the market share of 11%. Other types (cafe, pizza, bar and bakery) have minor shares from 4.5% to 2.9% of the establishments.

Proportions of chain and nonchain establishments

As we have already figured that there just two types of this data: wether the establishment belongs to a certain chain or not. The market shares for these types look as follows:

In [25]:
df_chains = df.groupby(['chain']).agg({'id': 'count'}).reset_index()

df_chains.columns = ['chain' ,'number']
#df_chains['chain'] = df_chains['chain'].astype(str) #changing data type
#df_chains['chain'].replace({"False": "Not Chain", "True": "Is Chain"}, inplace=True) #changing value names
df_chains.head()
Out[25]:
chain number
0 False 5972
1 True 3676
In [26]:
#pie chart for types of establishments referring to a chain
labels = df_chains['chain']
values = df_chains['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of Establishments')
fig.show()

The biggest market share belongs to establishments that do not refer to any specific chain and are independant, riginal and distinctive ones. Still chains occupy almost half of the market (38%).

Which type of establishment is typically a chain

We will create two data sets dor chain and non-chains and investigate the proportions.

In [27]:
#data set for non-chains
df_types_no_chains = df.query('chain == False').groupby(['object_type']).agg({'id': 'count'}).sort_values(by='id', ascending=False).reset_index()
df_types_no_chains
Out[27]:
object_type id
0 Restaurant 4961
1 Fast Food 461
2 Bar 215
3 Cafe 169
4 Pizza 166
5 Bakery 0
In [28]:
#data set for chains
df_types_chains = df.query('chain == True')

df_types_chains = df_types_chains.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types_chains.columns = ['object_type' ,'number']
df_types_chains.sort_values(by='number', ascending=False).reset_index(drop=True)
Out[28]:
object_type number
0 Restaurant 2292
1 Fast Food 605
2 Bakery 283
3 Cafe 266
4 Pizza 153
5 Bar 77
In [29]:
#visualizing the absolute values
sns.set_palette("Set2") 
plt.title('Number of different types of establishments (chains)')
ax = sns.barplot(x="object_type", y="number", data=df_types_chains)
ax.set_xticklabels(df_types_chains.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types_chains['number'].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()

In [30]:
#pie chart for market structure/ types of establishments: chais vs non-chains (visualizing the relative values)

from plotly.subplots import make_subplots

labels1 = df_types_chains['object_type']
labels2 = df_types_no_chains['object_type']
value1 = df_types_chains['number']
value2 = df_types_no_chains['id']

fig = make_subplots(2, 2, specs=[[{'type':'domain'},    {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['Chains', 'Non-chains'])
fig.add_trace(go.Pie(labels=labels1, values=value1, scalegroup='one',
                     name="Chains"), 1, 1)
fig.add_trace(go.Pie(labels=labels2, values=value2, scalegroup='one',
                     name="Non-chains"), 1, 2)

fig.update_layout(title_text='Market Shares of different types of establishments: chais vs non-chains')
fig.show()

An alternative charts is below.

In [31]:
#comparing absolute values
plt.figure(figsize=(10, 8))
splot=sns.countplot(x="object_type", hue="chain", data=df)
plt.xlabel('Type of establishments', size=14)
plt.title("Shares and number of different types of establishments: chais vs non-chains", size=18)
plt.ylabel('Number of establishments', size=14)
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, -12), 
                   textcoords = 'offset points')
    
plt.show()
In [32]:
#Calculating the percentage taking the whole market as 100%
plt.figure(figsize=(10, 8))
sns.set_palette("Set2")
plt.title('Shares and number of different types of establishments: chais vs non-chains', size=14)
ax = sns.countplot(x="object_type", hue="chain", data=df)
plt.ylabel('Number of establishments', size=12)
def with_hue(plot, feature, Number_of_categories, hue_categories):
    a = [p.get_height() for p in plot.patches]
    patch = [p for p in plot.patches]
    for i in range(Number_of_categories):
        #total = feature.value_counts().values[i]
        total = float(len(df)) 
        for j in range(hue_categories):
            percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
            x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
            y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height() 
            ax.annotate(percentage, (x, y), size = 12)
    plt.show()
with_hue(ax, df.object_type, 6,2)    


Number of establishments that belong to different chains is lower than the absolule number of original establishments. Although if we look at relative values we can see that the market structure differs.

Restaurants hold the leading position both among chains and non-chains, although their market share among chains is significatly less than among non-chains (62% vs 83%).

Share of chain fast food establishments is twice as big as non-chains (16.5% vs 8%).

Share of chain bakeries and chain cafes are aproximately equal (7-8%). Wherease Non-chain bakery seems to be extinct in LA, and non-chain cafe is also a rare find (only 3% of the market).

Pizzas have aproximately the same share both in chains and non-chains(3-4%).

The least popular type of chain establishment is bar with only 2% of market share.

What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?

We will look separetly at distribution of number of establishments per chain and distribution of average number of seats per establishment. Then we will combine these distibutions to draw some conclusions.

In [33]:
df_chain = df.query('chain==True')
#df_chain = df_chain.drop(['address', 'chain', 'object_type'], axis=1)
#df_chain = df_chain.groupby('object_name')['id'].count()['number'].median()
df_chain = df_chain.groupby('object_name').agg({'id': ['count'], 'number':['median']}).reset_index()
df_chain.columns = ['chain_name','number_of_spots', 'seats_average']
df_chain.sample(5)
Out[33]:
chain_name number_of_spots seats_average
1396 MAPO RESTAURANT 1 48.0
2192 STARBUCKS COFFEE #6467 1 14.0
2604 VIP CHINESE RESTAURANT 1 6.0
2220 STICKY RICE 2 25.5
2554 TRIMANA RESTAURANT 1 15.0
In [34]:
#plotting first distribution
df_chain['number_of_spots'].hist(bins=30, log=True)
plt.title('Average number of establishments within a chain - distribution')
plt.xlabel('number of establishments')
plt.ylabel('density')
plt.show()

The histogram shows that most of the distribution lies within interval between 1 and 15 spots per chain. Also we can see that there are some outliers with over 30 establishments within a chain. We will check them up.

In [35]:
#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers). 
np.percentile(df_chain['number_of_spots'], [95, 99])
Out[35]:
array([3., 6.])

Not more than 5% of chains contain over 3 spots and only 1% of chains include over 6 establishments. Consequently chains with over 6 spots can be viewed as outliers.

In [36]:
df_chain.describe()
Out[36]:
number_of_spots seats_average
count 2733.000000 2733.000000
mean 1.345042 41.361325
std 1.489055 44.319385
min 1.000000 1.000000
25% 1.000000 14.000000
50% 1.000000 26.000000
75% 1.000000 45.000000
max 47.000000 229.000000
In [37]:
#counting outliers
df_chain.query('number_of_spots>6').count()
Out[37]:
chain_name         23
number_of_spots    23
seats_average      23
dtype: int64
In [38]:
#checking outliers
df_chain.query('number_of_spots>6').sort_values(by='number_of_spots', ascending=False)
Out[38]:
chain_name number_of_spots seats_average
2459 THE COFFEE BEAN & TEA LEAF 47 24.0
2226 SUBWAY 31 16.0
519 DOMINO'S PIZZA 15 12.0
2608 WABA GRILL 14 41.5
1089 KENTUCKY FRIED CHICKEN 14 31.5
2544 TRIMANA 13 33.0
1433 MCDONALD'S 13 100.0
2667 YOGURTLAND 12 18.0
1735 PAPA JOHN'S PIZZA 12 10.0
878 HONG KONG EXPRESS 12 17.0
2103 STARBUCKS 12 19.0
2320 SUBWAY SANDWICHES 11 17.0
2635 WINGSTOP 10 17.0
1347 LOUISIANA FRIED CHICKEN 10 9.0
633 EL POLLO LOCO 10 47.0
375 CHIPOTLE MEXICAN GRILL 10 36.0
163 BLUE BOTTLE COFFEE 9 27.0
1095 KFC 9 37.0
1024 JERSEY MIKE'S SUBS 8 22.0
288 CARL'S JR 8 59.0
117 BASKIN ROBBINS 8 24.0
1790 PINKBERRY 7 17.0
348 CHINA EXPRESS 7 18.0

We have got 23 chains with more than 6 establishments. The names of this chains are well known, no surprizes here.

In [39]:
df_chain.query('number_of_spots==1')
Out[39]:
chain_name number_of_spots seats_average
0 #1 CHINESE FAST FOOD 1 6.0
1 1810 RESTAURANT 1 45.0
2 333 1 180.0
3 3RD ST MARKET & DELI 1 7.0
4 4 CAMINOS RESTAURANT 1 18.0
... ... ... ...
2727 ZAHN EMERGENCY SHELTER 1 61.0
2729 ZEN BAKERY (RETAIL) 1 19.0
2730 ZERMENOS RESTAURANT INC 1 10.0
2731 ZION MARKET DELI & ISLAND 1 16.0
2732 ZIP SUSHI & IZAKAYA 1 98.0

2240 rows × 3 columns

There are 2240 chains with only one establishment.

In [40]:
#plotting distribution of average number of seats per establishment
sns.distplot(df_chain['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()
In [41]:
#plotting distribution of average number of seats per establishment excluding outliers
df_chain_no_outliers = df_chain.query('1<number_of_spots<6')
sns.distplot(df_chain_no_outliers['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment (without outliers)')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()

Distribution looks almost the same.

In [42]:
#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers). 
np.percentile(df_chain['seats_average'], [95, 99])
Out[42]:
array([142.  , 204.68])

Not more than 5% of chains contain over 142 seats and only 1% of chains have over 205 seats.

In [43]:
#combining the distributions
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()
In [44]:
#distributions without outliers
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain_no_outliers)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()

Conclusion


There are lots of chains on the market (38% of all establishments). But the chains are mostly small: only 23 chains (1% of them) have over 6 establishments, only 5% of chains have over 3 establishments. We have 2240 chains with only 1 establishment. It might be a labeling error in the data or a those chains have only one branch in LA.

Distribution of number of seats looks more normal. Middle 50% of establishments have from 14 to 45 seats, average number of seats is 26. Not more than 5% of chains contain over 142 seats.

The current situation on the market looks as follows: chains are small, most of them have only one establishment in LA; there are lots of average sized chain establishments (26 seats on average); only 2 chains have over 30 establishments. Typical caracteristic for a chain in LA would be: 1-2 branches with 26 seats on average.

Average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats?

Firstly we will have a look at the average number of seats in all establishments.

In [45]:
plt.title('Median number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df) #mean value
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

Clearly the restaurants are the the largest establishments with 48 seats on average. The smallest are bakeries (average 22 seats) and cafes (25 seats).

Now we will check the median number of seats for all establishments (to get rid of the influence of outliers).

In [46]:
from numpy import median

#plt.figure(figsize=(8, 8))
plt.title('Median number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df, estimator=median) #median value
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

As we have gotten rid of the influence of outliers, the average numbers for the leaders dropped: for restaurants it is now 29 seats on average. The bars have almost the same average number. The average number of seats for all types of establishments is from 18 to 29 seats.

Now we will check the difference in average number of seats in chains in comparison to non-chain establishments.

In [47]:
#plotting the bar chart for chains vs original establishments
plt.figure(figsize=(10, 8))
plt.title('Average number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df, hue='chain', estimator=median) 

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.xlabel("Types of establishments", size=14)
plt.ylabel("Number of seats", size=14)
plt.show()

Looking into restaurant data separetely:

In [48]:
df_rest = df.query('object_type=="Restaurant"')
sns.set(style="whitegrid")
sns.color_palette("deep")

plt.figure(figsize=(10, 3))  
plt.title('Average number of seats for restaurants')

ax = sns.barplot(x="number", y="object_type", data=df_rest, hue='chain', estimator=median) 
for p in ax.patches:
    width = p.get_width()
    plt.text(5+p.get_width(), p.get_y()+0.55*p.get_height(),
             '{:1.2f}'.format(width),
             ha='center', va='center')

plt.xlabel("Number of seats", size=14)
plt.ylabel("Type of restaurant", size=14)

plt.show()
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

There is almost no difference in average number of seats for cafes and restaurants. Wherease for the rest of the type the differece is quite big. The most dramatic difference we can see in capacity of bakeries and bars. Chain bakeries are much bigger (by 18 seats on average), although chain bars are much smaller (by 12 seats on average).

Street names

In this section we are adding a column containing only street name, gettig rid of additional information in address of establishments.

In [49]:
#Filtering out the street name with usaddress library
#importing library
!pip install usaddress
import usaddress
Requirement already satisfied: usaddress in c:\users\hp\anaconda3\lib\site-packages (0.5.10)
Requirement already satisfied: python-crfsuite>=0.7 in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.9.7)
Requirement already satisfied: probableparsing in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.0.1)
Requirement already satisfied: future>=0.14 in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.18.2)
In [50]:
def cleaning_check(raw):
    raw_address=usaddress.parse(raw)
    dict_address={}
    for i in raw_address:
        dict_address.update({i[1]:i[0]})
    #checking for normal case with street and street name type    
    if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
        clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
        return clean_adress
    else:
        return 'no street or street type'
In [51]:
df['street_check']=df.address.apply(cleaning_check).reset_index(drop=True)
In [52]:
df.head()
Out[52]:
id object_name address chain object_type number street_check
0 11786 HABITAT COFFEE SHOP 3708 N EAGLE ROCK BLVD False Cafe 26 ROCK BLVD
1 11787 REILLY'S 100 WORLD WAY # 120 False Restaurant 9 WORLD WAY
2 11788 STREET CHURROS 6801 HOLLYWOOD BLVD # 253 False Fast Food 20 HOLLYWOOD BLVD
3 11789 TRINITI ECHO PARK 1814 W SUNSET BLVD False Restaurant 22 SUNSET BLVD
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20 PARK AVE
In [53]:
#checking for lines that got no street name through cleaning_check function
df[df['street_check']=='no street or street type']
Out[53]:
id object_name address chain object_type number street_check
138 11924 CARL'S JR 308 WESTWOOD PLZ 505 True Restaurant 37 no street or street type
207 11993 CLIFTON'S CAFETERIA 648 S BROADWAY False Restaurant 2 no street or street type
209 11995 GOURMENT LA BAKERY 548 S BROADWAY True Bakery 27 no street or street type
257 12043 B TWENTY FOUR 2223 N BROADWAY False Restaurant 15 no street or street type
267 12053 THE HEIGHTS DELI AND BOTTLE SHOP 2927 N BROADWAY True Restaurant 11 no street or street type
... ... ... ... ... ... ... ...
9572 21358 LE MACARON 541 S SPRING # 127 True Restaurant 17 no street or street type
9575 21361 CHINESE TASTE FAST FOOD 1630 E CESAR E CHAVEZ True Restaurant 13 no street or street type
9587 21373 VALENCIA REST. GROUP #1 727 N BROADWAY # 120 False Restaurant 39 no street or street type
9590 21376 PIRATES MEXICAN FOOD 8511 S BROADWAY True Restaurant 5 no street or street type
9622 21408 LINCOLN KITCHEN & TAP 2118 N BROADWAY False Restaurant 76 no street or street type

250 rows × 7 columns

In [54]:
#parsing one of the problem addresses to investigate the root of the problem
usaddress.parse(df.iloc[1888]['address'])
Out[54]:
[('OLVERA', 'PlaceName'), ('ST', 'StateName'), ('E17', 'ZipCode')]
In [55]:
usaddress.parse(df.iloc[9622]['address'])
Out[55]:
[('2118', 'AddressNumber'),
 ('N', 'StreetNamePreDirectional'),
 ('BROADWAY', 'StreetName')]
In [56]:
usaddress.parse(df.iloc[1]['address'])
Out[56]:
[('100', 'AddressNumber'),
 ('WORLD', 'StreetName'),
 ('WAY', 'StreetNamePostType'),
 ('#', 'OccupancyIdentifier'),
 ('120', 'OccupancyIdentifier')]
In [57]:
def cleaning_final(raw):
    if raw.startswith('OLVERA'):
        clean_adress='OLVERA,Los Angeles,USA'
    elif raw.startswith('1033 1/2 LOS ANGELES'):
        clean_adress='1033 1/2 LOS ANGELES ST,Los Angeles,USA'
    else:
        raw_address=usaddress.parse(raw)
        dict_address={}
        for i in raw_address:
            dict_address.update({i[1]:i[0]})
        if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
            clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
            return clean_adress
        else:
            clean_adress1= str(dict_address['StreetName'])
        return clean_adress1
   # return clean_adress
In [58]:
df['street_name']=df.address.apply(cleaning_final).reset_index(drop=True)
df.sample(10)
Out[58]:
id object_name address chain object_type number street_check street_name
1060 12846 HAIFA RESTAURANT 8717 W PICO BLVD True Restaurant 38 PICO BLVD PICO BLVD
2307 14093 SUPERBA FOOD & BREAD 1900 LINCOLN BLVD True Restaurant 102 LINCOLN BLVD LINCOLN BLVD
3227 15013 BURGER KING #11024 5609 W SUNSET BLVD STE C True Fast Food 40 SUNSET BLVD SUNSET BLVD
8223 20009 CANDILEJAS CAFE 3425 E 1ST ST STE 104B False Restaurant 3 1ST ST 1ST ST
2525 14311 BAMBOO EXPRESS 600 S ALVARADO ST # 107-8 False Fast Food 28 ALVARADO ST ALVARADO ST
3136 14922 CHIPOTLE MEXICAN GRILL 5047 EAGLE ROCK BLVD True Restaurant 37 ROCK BLVD ROCK BLVD
2646 14432 ARTUR'S GARDEN 1240 HOOPER AVE False Restaurant 13 HOOPER AVE HOOPER AVE
4875 16661 MCDONALD'S #4844 1210 S SOTO ST True Restaurant 61 SOTO ST SOTO ST
7196 18982 CAFE AMERICANO 5770 MELROSE AVE STE 101 False Restaurant 27 MELROSE AVE MELROSE AVE
108 11894 JINYA RAMEN EXPRESS 6801 HOLLYWOOD BLVD # 315 False Fast Food 6 HOLLYWOOD BLVD HOLLYWOOD BLVD
In [59]:
#check for corrections in the problem addresses
df.query('street_name == "OLVERA ST"')
Out[59]:
id object_name address chain object_type number street_check street_name
2533 14319 EL PASEO INN 11 E OLVERA ST False Restaurant 91 OLVERA ST OLVERA ST
In [60]:
#check for corrections in the problem addresses
df.query('street_name == "LOS ANGELES ST"')
Out[60]:
id object_name address chain object_type number street_check street_name
In [61]:
#dropping extra column:
df = df.drop('street_check', axis=1).reset_index(drop=True)
df.head()
Out[61]:
id object_name address chain object_type number street_name
0 11786 HABITAT COFFEE SHOP 3708 N EAGLE ROCK BLVD False Cafe 26 ROCK BLVD
1 11787 REILLY'S 100 WORLD WAY # 120 False Restaurant 9 WORLD WAY
2 11788 STREET CHURROS 6801 HOLLYWOOD BLVD # 253 False Fast Food 20 HOLLYWOOD BLVD
3 11789 TRINITI ECHO PARK 1814 W SUNSET BLVD False Restaurant 22 SUNSET BLVD
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20 PARK AVE
In [62]:
df['street_name'].value_counts()
Out[62]:
SUNSET BLVD      403
WILSHIRE BLVD    397
PICO BLVD        370
WESTERN AVE      367
FIGUEROA ST      332
                ... 
ALAVARADO ST       1
PORTIA ST          1
FETTERLY AVE       1
BARRY AVE          1
MANZANITA ST       1
Name: street_name, Length: 529, dtype: int64

We have got a list of 529 unique street names and formed a separate column 'street_name'.

In [63]:
df['street2'] = (df['address']
                 .replace('(^[0-9]+)','',regex=True)
                 .replace('[0-9 #]+$','',regex=True)
                 .replace('[-#.&,]','',regex=True)
                 .replace('["\\/]','',regex=True)
                 .replace('(^| ).(( ).)*( |$)','',regex=True)
                 .replace('( ){2}','',regex=True)
                 .replace('^ *','',regex=True)
                 .replace(['FLR', 'FL', 'STE', 'SC', 'UNIT', 'APT', 'CPCE', 'MEZZ', 'LOBBY', 'LBBY', 'ROOM', 'LOWR', 'BLDG', 'AU', '101C', '2B'],'',regex=True)
                )
df['street2'].value_counts()
Out[63]:
SUNSET BLVD             336
PICO BLVD               331
FIGUEROA ST             270
WERN AVE                265
WILSHIRE BLVD           241
                       ... 
12HUNTINGTON DRSOUTH      1
PICO BLVD  10B            1
VENICE BLVD  C            1
GRANDE VISTA AVE          1
FETTERLY AVE              1
Name: street2, Length: 1016, dtype: int64
In [64]:
#Check up for the long street name
df.query('street2=="ECHO PARK AVE"')
Out[64]:
id object_name address chain object_type number street_name street2
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20 PARK AVE ECHO PARK AVE
801 12587 LITTLE CAESAR 1321 ECHO PARK AVE True Restaurant 16 PARK AVE ECHO PARK AVE
811 12597 VALERIE 1665 ECHO PARK AVE False Restaurant 6 PARK AVE ECHO PARK AVE
816 12602 CHANGO 1559 ECHO PARK AVE False Restaurant 34 PARK AVE ECHO PARK AVE
2151 13937 BEACON ECHO PARK 751 ECHO PARK AVE False Restaurant 29 PARK AVE ECHO PARK AVE
2886 14672 LOS BURRITOS 1312 ECHO PARK AVE True Restaurant 15 PARK AVE ECHO PARK AVE
2917 14703 THE HOLLOWAY 1320 ECHO PARK AVE False Restaurant 128 PARK AVE ECHO PARK AVE
4502 16288 GLOWING JUICES 1557 ECHO PARK AVE False Restaurant 3 PARK AVE ECHO PARK AVE
5585 17371 FIX COFFEE 2100 ECHO PARK AVE False Cafe 29 PARK AVE ECHO PARK AVE
5941 17727 CHINATOWN EXPRESS 1411 ECHO PARK AVE #101 True Fast Food 7 PARK AVE ECHO PARK AVE
6316 18102 THE WHISPERER 1325 ECHO PARK AVE False Restaurant 34 PARK AVE ECHO PARK AVE
6825 18611 SQUARE ONE AT THE BOATHOUSE 751 ECHO PARK AVE False Restaurant 23 PARK AVE ECHO PARK AVE
In [65]:
#As I have chosen to proceed with the 1st method (column 'street_name') I will drop the dropping extra column:
df = df.drop('street2', axis=1).reset_index(drop=True)
df.head()
Out[65]:
id object_name address chain object_type number street_name
0 11786 HABITAT COFFEE SHOP 3708 N EAGLE ROCK BLVD False Cafe 26 ROCK BLVD
1 11787 REILLY'S 100 WORLD WAY # 120 False Restaurant 9 WORLD WAY
2 11788 STREET CHURROS 6801 HOLLYWOOD BLVD # 253 False Fast Food 20 HOLLYWOOD BLVD
3 11789 TRINITI ECHO PARK 1814 W SUNSET BLVD False Restaurant 22 SUNSET BLVD
4 11790 POLLEN 2100 ECHO PARK AVE False Restaurant 20 PARK AVE

Top ten streets by number of restaurants


[back to top](#top)

[previous part](#8)

We will define the most popular streets for all establishments firstly and then the most popular ones for restaurants.

In [66]:
streets_by_est = df.groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()

#renaming columns
streets_by_est.columns = ['street_name','establishments', 'avg_seats' ]
streets_by_est['avg_seats'] = streets_by_est['avg_seats'].astype(int)
streets_by_est.sort_values(by='establishments', ascending=False).head(10)
Out[66]:
street_name establishments avg_seats
446 SUNSET BLVD 403 34
511 WILSHIRE BLVD 397 36
390 PICO BLVD 370 27
499 WESTERN AVE 367 29
211 FIGUEROA ST 332 25
370 OLYMPIC BLVD 309 29
469 VERMONT AVE 286 28
347 MONICA BLVD 264 22
25 3RD ST 261 26
263 HOLLYWOOD BLVD 253 34
In [67]:
#saving top 10 streets by number of establishments to a separate data frame
top_10_streets = streets_by_est.query('establishments>=253').sort_values(by='establishments', ascending=False)
top_10_streets
Out[67]:
street_name establishments avg_seats
446 SUNSET BLVD 403 34
511 WILSHIRE BLVD 397 36
390 PICO BLVD 370 27
499 WESTERN AVE 367 29
211 FIGUEROA ST 332 25
370 OLYMPIC BLVD 309 29
469 VERMONT AVE 286 28
347 MONICA BLVD 264 22
25 3RD ST 261 26
263 HOLLYWOOD BLVD 253 34

In [68]:
plt.figure(figsize=(9,5))

# plot barh chart 
ax = sns.barplot(top_10_streets.street_name, top_10_streets.establishments, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="Streets", ylabel='Number of establishments')

ax.set_xticklabels(top_10_streets.street_name)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(top_10_streets["establishments"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)

plt.title('Top 10 streets with the biggest number of establishments')
plt.show()


We got 10 the most popular streets with the biggest number of establishments.

Now we will check the most popular streets for restaurants.

In [69]:
top_10_streets['street_name'].unique()
Out[69]:
array(['SUNSET BLVD', 'WILSHIRE BLVD', 'PICO BLVD', 'WESTERN AVE',
       'FIGUEROA ST', 'OLYMPIC BLVD', 'VERMONT AVE', 'MONICA BLVD',
       '3RD ST', 'HOLLYWOOD BLVD'], dtype=object)
In [70]:
streets_by_restaurants = df.query('object_type=="Restaurant"').groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()
streets_by_restaurants.columns = ['street_name','establishments', 'avg_seats' ] #renaming columns
streets_by_restaurants['avg_seats'] = streets_by_restaurants['avg_seats'].astype(int)
streets_by_restaurants.sort_values(by='establishments', ascending=False).head(10)
Out[70]:
street_name establishments avg_seats
475 WILSHIRE BLVD 318 37
417 SUNSET BLVD 296 38
465 WESTERN AVE 291 34
363 PICO BLVD 283 29
344 OLYMPIC BLVD 257 31
194 FIGUEROA ST 233 25
436 VERMONT AVE 216 32
22 3RD ST 207 28
323 MONICA BLVD 204 24
244 HOLLYWOOD BLVD 197 38


We have got the same list of the streets with a slight difference in ranging.

We can assume that this is the list of the most popular and luxury streets in LA where most of establishments are restaurants.

These streets represent the most highly competitive region in LA.

Streets that only have one restaurant.

In [71]:
#streets with only one establishment
one_establishment_str = streets_by_est.query('establishments==1').count()
one_establishment_str
Out[71]:
street_name       238
establishments    238
avg_seats         238
dtype: int64
In [72]:
#streets with one restaurant:
one_rest_str = streets_by_restaurants.query('establishments==1').count()
one_rest_str
Out[72]:
street_name       227
establishments    227
avg_seats         227
dtype: int64

According to our data there are 238 streets with only one establishment and 227 streets where the only establishment is a restaurant. In other words if there is an only establishment on the street, most likety is will be a restaurant.

Streets with a lot of restaurants: distribution of the number of seats. Trends.


Firstly we will look at the distribution of number of establishments in the streets to define "streets with a lot of restaurants".

In [73]:
streets_by_est[['establishments', 'avg_seats']].describe()
Out[73]:
establishments avg_seats
count 529.000000 529.000000
mean 18.226843 45.763705
std 51.111709 46.346306
min 1.000000 2.000000
25% 1.000000 19.000000
50% 2.000000 28.000000
75% 9.000000 47.000000
max 403.000000 229.000000
In [74]:
#Calculatinug the percentiles for number of establishments per street. 
np.percentile(streets_by_est['establishments'], [85, 95, 99])
Out[74]:
array([ 23.  ,  89.2 , 302.56])
In [75]:
#poltting distribution
plt.figure(figsize=(9,5))
sns.distplot(streets_by_est['establishments']) 
 
plt.title('number of establishments per street')
plt.show()

We can see that middle 50% of streets have 1-9 restaurants. Only 15% of streets have over 23 restaurants and only 5% of streets have over 90 restaurants. To define to point that means "a lot" we will perfom additional calculations.


We will define how popular is the street in relation to the mean value of establishments registrated there. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the number of establishments in the street is identical to the mean value.

In [76]:
#counting z-score for number of establishments to define distribution 
streets_by_est['establishments_z']=(streets_by_est['establishments']-streets_by_est['establishments'].mean())/streets_by_est['establishments'].std()
streets_by_est= streets_by_est.sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est
Out[76]:
street_name establishments avg_seats establishments_z
0 SUNSET BLVD 403 34 7.528082
1 WILSHIRE BLVD 397 36 7.410692
2 PICO BLVD 370 27 6.882438
3 WESTERN AVE 367 29 6.823743
4 FIGUEROA ST 332 25 6.138968
... ... ... ... ...
524 GAREY ST 1 122 -0.337043
525 GARDNER ST 1 91 -0.337043
526 FRANCISCO ST 1 133 -0.337043
527 FORMOSA AVE 1 144 -0.337043
528 vine ST 1 162 -0.337043

529 rows × 4 columns

Below there is a chart with a breaking point. There are much more streets to the both sides, this one is just an illustation of the idea of z-score ranging.

In [77]:
#distinguishing z-score by colors: red for negative values, blue for positive ones
streets_by_est_z = streets_by_est.query('-0.1 <=establishments_z<=0.1').sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est_z['colors'] = ['red' if x<0 else 'blue' for x in streets_by_est_z['establishments_z']]
plt.figure(figsize=(14,10))
plt.hlines(y=streets_by_est_z.street_name, xmin=0, xmax=streets_by_est_z.establishments_z, colors=streets_by_est_z.colors, alpha=0.4, linewidth=10)
plt.title('Defining streets with a lot of restaurants')
plt.xlabel("z-score", size=14)
plt.show()

Now when we have defined the breaking point (Z=0; establishments = 22), we will form a new data frame containing only streets with a lot of restaurants.

In [78]:
#filtering out streets with many restaurants: 
streets_many_rest = streets_by_est.query('establishments_z>0')
streets_many_rest.head()
Out[78]:
street_name establishments avg_seats establishments_z
0 SUNSET BLVD 403 34 7.528082
1 WILSHIRE BLVD 397 36 7.410692
2 PICO BLVD 370 27 6.882438
3 WESTERN AVE 367 29 6.823743
4 FIGUEROA ST 332 25 6.138968

Now we will move to study of seats distribution.

In [79]:
streets_many_rest.describe()
Out[79]:
establishments avg_seats establishments_z
count 92.000000 92.000000 92.000000
mean 90.445652 27.010870 1.412960
std 93.343436 10.092424 1.826263
min 19.000000 14.000000 0.015127
25% 31.500000 21.750000 0.259689
50% 54.500000 26.000000 0.709684
75% 97.750000 32.000000 1.555870
max 403.000000 101.000000 7.528082
In [80]:
#Calculatinug the percentiles for number of seats per restaurant. 
np.percentile(streets_many_rest['avg_seats'], [85, 95, 99])
Out[80]:
array([33.35, 36.  , 49.13])
In [81]:
sns.distplot(streets_many_rest['avg_seats'], bins=10)
plt.title('Distribution of number of seats in restaurants on the highly competitive streets')
plt.show()

General distrubution of seats number looks as follows: average number is 26, middle 50% of restaurants have from 21 to 31 seats, maximum number is 100 seats. Only 5% of restaurants have more than 36 seats. There are no restaurants in the popular streets with less than 14 seats.

Now we will use qcut() method to split the streets into 5 equal groups.

In [82]:
pd.qcut(streets_many_rest['establishments'], 5)
Out[82]:
0     (118.0, 403.0]
1     (118.0, 403.0]
2     (118.0, 403.0]
3     (118.0, 403.0]
4     (118.0, 403.0]
           ...      
87    (18.999, 26.2]
88    (18.999, 26.2]
89    (18.999, 26.2]
90    (18.999, 26.2]
91    (18.999, 26.2]
Name: establishments, Length: 92, dtype: category
Categories (5, interval[float64]): [(18.999, 26.2] < (26.2, 42.4] < (42.4, 74.4] < (74.4, 118.0] < (118.0, 403.0]]
In [83]:
#splitting streets into 5 categories

def cat(est):  
    if est <= 26:
        return '19-26'
    if est > 26 and est <= 42:
        return '27-42'
    if est > 42 and est <= 74:
        return '43-74'
    if est > 74 and est <= 118:
        return '75-118'
    return 'over 118'

streets_many_rest['establishments_number'] = streets_many_rest['establishments'].apply(cat)
streets_many_rest.head()
 
Out[83]:
street_name establishments avg_seats establishments_z establishments_number
0 SUNSET BLVD 403 34 7.528082 over 118
1 WILSHIRE BLVD 397 36 7.410692 over 118
2 PICO BLVD 370 27 6.882438 over 118
3 WESTERN AVE 367 29 6.823743 over 118
4 FIGUEROA ST 332 25 6.138968 over 118

Now we have 5 groups of streets and we will investigate distribution of seats for every group.

In [84]:
plt.figure(figsize=(22,8))
sns.stripplot(x="establishments_number", y="avg_seats", data=streets_many_rest) 

plt.title('Scatter plot for seats per different number of restaurants per street', size=14)
plt.show()
In [85]:
plt.figure(figsize=(22,8))
sns.violinplot(x="establishments_number", y="avg_seats", data=streets_many_rest, palette='rainbow')
plt.show()


There is no clear difference between the groups in the niumber of seats. We can only say that the biggest diference is between restaurants on the streets with 19-26 establishments distribution with a big tail to the right, more outliers, the largest restaurants are in this category) and the most popular streets with over 118 restaurant (distribution is smaller, from 22 to 42 seats).

The streets with 43-74 establishments tend to have restaurants with capasity above average (28-34).

In [86]:
est_seats = streets_many_rest.drop(['establishments_z'], axis=1) #dropping extra column
sns.pairplot(est_seats, hue='establishments_number')

plt.title('Paired distributions of average number of seats on different street')
plt.show()
In [87]:
#pair plot without 3rd parameter
est_seats1 = streets_many_rest.drop(['establishments_z', 'establishments_number'], axis=1)
sns.pairplot(est_seats1)

plt.title('Paired distributions of average number of seats on different street')
plt.show()


Paired distribution confirms that the more establishments there are in the street the closer they are to an average size.

The smallest and the largest restaurants are in the streets with 19-26 establishments.

This can be viewed as a general trend.

Overall conclusion and recommendations on restaurant type and number of seats. Possibility of developing a chain.

We have looked into the data on restaurants in LA and figured out the following:

  • 75% of establishments are restaurants, 11% fast_food; bakery is the least popular type of establishments
  • 38% of establishments belong to a chain
  • structure of market for chain and non-chains differs see pic
  • lots of chains have only 1 establishment in LA
  • a typical characteristic for a chain in LA would be: 1-2 branches with 26 seats on average.
  • an average number of seats for all types of establishments varies from 18 to 29 seats
  • based on the addresses of the establishments we have figured out the most popular streets: SUNSET, 'WILSHIRE', 'PICO', 'WESTERN', 'FIGUEROA', 'OLYMPIC', 'VERMONT', 'MONICA', '3RD', 'HOLLYWOOD.see chart
  • there are 185 streets with only one establishment. These are the streets that are not as competitive as the top ones but still they have their customer base and formed demand.
  • we have defined streets with lots of restaurants based on the average number of restaurants (which is 21 restaurant per street).
  • On average there are 26 seats in a restaurant that is placed in the popular streets (middle 50% of them have from 14 to 31 seats), maximum number of seats is 44. Only 1% of restaurants have over 40 seats.
  • Paired distribution shows that the more establishments there are in the street the closer they are to an average size. The smallest and the largest restaurants are in the streets with 22-30 establishments (among the popular streets) chart.
  • average number of seats is higher in chain bakeries, cafes and fast food, for the restaurants the average number of seats is approximately the same.

General conclusion. LA restaurant market is a highly competitive medium.
Based on the previous analysis we can offer 2 main strategies.

  1. Stategy that requires more investments

    • open original (non-chain) restaurant on one of the most popular streets of LA. These are areas with highly effective demand and high foot traffic.
    • type of the establishment - restaurant. Most of the establiushments there are restaurants so the area has already formed demand and customers have certain expectations visiting this streats.
    • number of seats 20-26 as this is an average number for most types of establishments in the area. Most likely this is the optimal rate of occupasy and payback of the premises.
    • no-chain, highlight on the unique experience of the place.
  2. Strategy that requires less investments

    • open a restauranton a less popular area of LA with fewer restaurants. For example it could be the university area (dynamics, philosophy, vibes, open-mindnes of the area suits the idea of an innovative robot-run cafe without man labor) + expo area (large people flow, dynamics, innovations-inspired and driven medium).
    • relatively low competitive area, only 2 fast-food eateries.
    • type of the establishment - fast-food. It is cheaper to operate, no high-class Michelin chefs needed, suits the dynamics of the area. People already got used to a sertain type of food and have similar expectations (fast food)
    • develop a chain:
      • additional income for selling franchise license,
      • economies of scale
    • number of seats from 24 and up. Start at the lower point (24 seats is an average number for chains in LA), but there should be an opportunity to enlarge the capasity of the premises in the future. Supposedly rental fare in the such area is lower than in the top popular streets. AThus the restaurant should operate on a large scale and we could use economies of scale.

To make final desisions and choose the strategy we need some additional information:

  1. rental fare in the chosen area,
  2. average check amount and its dynamics,
  3. average attendance of the restaurants in the area
  4. what plans do athourities have for developing the area for the next 5-10 years (erecting new buildings, reconsructions, building new roads, interchanges)
  5. calculating break-even point and other key metrics

In order to choose the final strategy we need additional thorough calulations. In current circumstances with Covid pandemic the idea of a robot-run eatery without people involved looks like a very timely one.

In [ ]: