Below is the market research of an open-source data on restaurants in LA with the purpose to open a small robot-run cafe in Los Angeles.
The main aim is to study the current market conditions.

Firstly we will preprocess the data on the restaurants and other eateries in LA.

Then we will study the data: choose the most popular restaurant streets and the least popular ones, study the current marker structure (establishment types and their shares), how number of seats vary depending on type and place of a restaurant.
Finally we will draw some conclusions.

Table of contents:¶

Part 1. Data preprocessing
Part 2. Data analysis
Proportions of the various types of establishments
Proportions of chain and nonchain establishments
Which type of establishment is typically a chain
What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats?
Average number of seats for each type of restaurant
Street names
Top ten streets by number of restaurants
Streets that only have one restaurant
Streets with a lot of restaurants: distribution of the number of seats. Trends.
Overall conclusion and recommendations on restaurant type and number of seats. Possibility of developing a chain.

import pandas as pd
from IPython.display import display
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from scipy import stats as st
import numpy
import seaborn as sns
import re
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

df = pd.read_csv('./rest_data_us.csv', sep=',')

Part 1. Data preprocessing ¶

display(df.head())

display(df.sample(6))

Checking data types¶

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           9651 non-null   int64 
 1   object_name  9651 non-null   object
 2   address      9651 non-null   object
 3   chain        9648 non-null   object
 4   object_type  9651 non-null   object
 5   number       9651 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 1.5 MB

df['object_name'].value_counts()

THE COFFEE BEAN & TEA LEAF         47
SUBWAY                             31
DOMINO'S PIZZA                     15
KENTUCKY FRIED CHICKEN             14
WABA GRILL                         14
                                   ..
TASTE NOODLE HOUSE                  1
CAFE AT THE END OF THE UNIVERSE     1
WIMPY'S BURGER                      1
DEL TACO #988                       1
ANGELES                             1
Name: object_name, Length: 8672, dtype: int64

df['object_type'].value_counts()

Restaurant    7255
Fast Food     1066
Cafe           435
Pizza          320
Bar            292
Bakery         283
Name: object_type, dtype: int64

df['chain'].value_counts()

False    5972
True     3676
Name: chain, dtype: int64

Checking for duplicated data¶

#Checking for duplicated data
df.duplicated().sum()

0

#stasistical summary of the number of seats
df['number'].describe()

count    9651.000000
mean       43.695161
std        47.622874
min         1.000000
25%        14.000000
50%        27.000000
75%        46.000000
max       229.000000
Name: number, dtype: float64

df['number'].hist(bins=30, log=True)
plt.show()

#Calculatinug the 95th and 99th percentiles for number of places in a establishment. 
np.percentile(df['number'], [95, 99])

array([148., 214.])

#statistical summary for categorical variables
df.describe(include=['object'])

#looking into establiushments placed in the most popular address:
pop_adress = df.query('address =="3607 TROUSDALE PKWY"')
pop_adress

df['object_name'].value_counts().count()

8672

Checking for missing values¶

#Checking data for zeros:
for i in df.columns:
    print(i, len(df[df[i]==0]))

id 0
object_name 0
address 0
chain 5972
object_type 0
number 0

# defining columns with missing values 
missing_list=[]
for x in df:
    if len(df[df[x].isnull()])>0:
        missing_list.append(x)
print('The following columns have missing values: ', missing_list)

The following columns have missing values:  ['chain']

# calculating percentage of missing values in every column
missing_percentage=[]
for x in missing_list:
    missing_percentage.append([x,(len(df[df[x].isnull()])/len(df)*100)])
missing_percentage=pd.DataFrame(missing_percentage, columns=['column','missing_values_percentage%'])
missing_percentage.sort_values(by=['missing_values_percentage%'], ascending=False)

#cheking the rows withoud data in 'chain' column:
df.loc[df['chain'].isnull()]

We have defined that missing values are only in 'chain' column. The share of this data is low (0.03%). As it is lower than 2% we can drop these lines without scewing futher calculations.

# dropping the lines with missing values
df = df.dropna().reset_index(drop=True) 
#df=df[df.notnull().all(axis=1)]

#cheking if NaNs are dropped:
df[df.isnull().any(axis=1)] #displays at least one missing value in any column

Changing data types¶

We have saved 0.3 Mb of the memory used initially by changing the data type.

#Changing data types from object to category to low the memory usage:

#from pandas.api.types import CategoricalDtype
df = pd.read_csv('./rest_data_us.csv', sep=',',
                  dtype={'object_name': 'str', 'address': 'str', 'object_type': 'category'}) 

df.info(memory_usage='deep')#checking how much is memory usage now.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           9651 non-null   int64   
 1   object_name  9651 non-null   object  
 2   address      9651 non-null   object  
 3   chain        9648 non-null   object  
 4   object_type  9651 non-null   category
 5   number       9651 non-null   int64   
dtypes: category(1), int64(2), object(3)
memory usage: 1.2 MB

Conclusion¶

The data for analysis contains 6 rows with 9651 entries.

There are 8672 unique names (out of 9651) of the Restaurants/cafes and 8517 unique adresses (out of 9651). The biggest chain (the most frequently encountered name) has 47 establishments; the name is 'THE COFFEE BEAN & TEA LEAF'. And the place with the biggest number of establishments (11 ones) has address as follows: 3607 TROUSDALE PKWY.

There are 6 types of the establishments in the data: Restaurant, Fast Food, Cafe, Pizza, Bar, Bakery .

Number of seats in the establishments vary from 1 to 229 with mean value of 44 seats. Middle 50% of establishments have from 14 to 46 seats. Not more than 5% of establishments have over 148 seats and only 1% have over 214 seats.

No duplicates have been found in data set.

Also the df contains information if the establishment is a chain or not. We had 3 establishments with missing values on this data and we dropped this lines without harm to the rest of the data as the percentage of missing values was less than 3%.

We have changed some data types from object to category to improve the memory usage (it dropped from 2.4 Mb to 1.8 Mb): columns 'object_type' and 'chain'.

back to top

Part 2. Data analysis ¶

Proportions of the various types of establishments ¶

back to top

previous part

As we have just figured it out in the previous part, there are 6 types of establisments in the region. Below is the graphical interpretation of the data.

#creating a table 
df_types = df.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types.columns = ['object_type' ,'number']
df_types.head()

#biulding a bar chart
sns.set_palette("Set2") 
plt.title('Number of different types of establishments')
ax = sns.barplot(x="object_type", y="number", data=df_types)
ax.set_xticklabels(df_types.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types['number'].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()

#pie chart for types of establishments
labels = df_types['object_type']
values = df_types['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of different types of establishments')
fig.show()

We can see that the largest market share belongs to restaurants, they occupy 75% of the market. Fast food is the second large type but is 7 times smaller than the main leader with the market share of 11%. Other types (cafe, pizza, bar and bakery) have minor shares from 4.5% to 2.9% of the establishments.

Proportions of chain and nonchain establishments ¶

back to top

previous part

As we have already figured that there just two types of this data: wether the establishment belongs to a certain chain or not. The market shares for these types look as follows:

df_chains = df.groupby(['chain']).agg({'id': 'count'}).reset_index()

df_chains.columns = ['chain' ,'number']
#df_chains['chain'] = df_chains['chain'].astype(str) #changing data type
#df_chains['chain'].replace({"False": "Not Chain", "True": "Is Chain"}, inplace=True) #changing value names
df_chains.head()

#pie chart for types of establishments referring to a chain
labels = df_chains['chain']
values = df_chains['number']
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text='Market Shares of Establishments')
fig.show()

The biggest market share belongs to establishments that do not refer to any specific chain and are independant, riginal and distinctive ones. Still chains occupy almost half of the market (38%).

Which type of establishment is typically a chain ¶

back to top

previous part

We will create two data sets dor chain and non-chains and investigate the proportions.

#data set for non-chains
df_types_no_chains = df.query('chain == False').groupby(['object_type']).agg({'id': 'count'}).sort_values(by='id', ascending=False).reset_index()
df_types_no_chains

#data set for chains
df_types_chains = df.query('chain == True')

df_types_chains = df_types_chains.groupby(['object_type']).agg({'id': 'count'}).reset_index()
#renaming columns
df_types_chains.columns = ['object_type' ,'number']
df_types_chains.sort_values(by='number', ascending=False).reset_index(drop=True)

#visualizing the absolute values
sns.set_palette("Set2") 
plt.title('Number of different types of establishments (chains)')
ax = sns.barplot(x="object_type", y="number", data=df_types_chains)
ax.set_xticklabels(df_types_chains.object_type)
#for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(df_types_chains['number'].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)
plt.tight_layout()

#pie chart for market structure/ types of establishments: chais vs non-chains (visualizing the relative values)

from plotly.subplots import make_subplots

labels1 = df_types_chains['object_type']
labels2 = df_types_no_chains['object_type']
value1 = df_types_chains['number']
value2 = df_types_no_chains['id']

fig = make_subplots(2, 2, specs=[[{'type':'domain'},    {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['Chains', 'Non-chains'])
fig.add_trace(go.Pie(labels=labels1, values=value1, scalegroup='one',
                     name="Chains"), 1, 1)
fig.add_trace(go.Pie(labels=labels2, values=value2, scalegroup='one',
                     name="Non-chains"), 1, 2)

fig.update_layout(title_text='Market Shares of different types of establishments: chais vs non-chains')
fig.show()

An alternative charts is below.

#comparing absolute values
plt.figure(figsize=(10, 8))
splot=sns.countplot(x="object_type", hue="chain", data=df)
plt.xlabel('Type of establishments', size=14)
plt.title("Shares and number of different types of establishments: chais vs non-chains", size=18)
plt.ylabel('Number of establishments', size=14)
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, -12), 
                   textcoords = 'offset points')
    
plt.show()

#Calculating the percentage taking the whole market as 100%
plt.figure(figsize=(10, 8))
sns.set_palette("Set2")
plt.title('Shares and number of different types of establishments: chais vs non-chains', size=14)
ax = sns.countplot(x="object_type", hue="chain", data=df)
plt.ylabel('Number of establishments', size=12)
def with_hue(plot, feature, Number_of_categories, hue_categories):
    a = [p.get_height() for p in plot.patches]
    patch = [p for p in plot.patches]
    for i in range(Number_of_categories):
        #total = feature.value_counts().values[i]
        total = float(len(df)) 
        for j in range(hue_categories):
            percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
            x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
            y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height() 
            ax.annotate(percentage, (x, y), size = 12)
    plt.show()
with_hue(ax, df.object_type, 6,2)

Number of establishments that belong to different chains is lower than the absolule number of original establishments. Although if we look at relative values we can see that the market structure differs.

Restaurants hold the leading position both among chains and non-chains, although their market share among chains is significatly less than among non-chains (62% vs 83%).

Share of chain fast food establishments is twice as big as non-chains (16.5% vs 8%).

Share of chain bakeries and chain cafes are aproximately equal (7-8%). Wherease Non-chain bakery seems to be extinct in LA, and non-chain cafe is also a rare find (only 3% of the market).

Pizzas have aproximately the same share both in chains and non-chains(3-4%).

The least popular type of chain establishment is bar with only 2% of market share.

What characterizes chains: many establishments with a small number of seats or a few establishments with a lot of seats? ¶

back to top

previous part

We will look separetly at distribution of number of establishments per chain and distribution of average number of seats per establishment. Then we will combine these distibutions to draw some conclusions.

df_chain = df.query('chain==True')
#df_chain = df_chain.drop(['address', 'chain', 'object_type'], axis=1)
#df_chain = df_chain.groupby('object_name')['id'].count()['number'].median()
df_chain = df_chain.groupby('object_name').agg({'id': ['count'], 'number':['median']}).reset_index()
df_chain.columns = ['chain_name','number_of_spots', 'seats_average']
df_chain.sample(5)

#plotting first distribution
df_chain['number_of_spots'].hist(bins=30, log=True)
plt.title('Average number of establishments within a chain - distribution')
plt.xlabel('number of establishments')
plt.ylabel('density')
plt.show()

The histogram shows that most of the distribution lies within interval between 1 and 15 spots per chain. Also we can see that there are some outliers with over 30 establishments within a chain. We will check them up.

#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers). 
np.percentile(df_chain['number_of_spots'], [95, 99])

array([3., 6.])

Not more than 5% of chains contain over 3 spots and only 1% of chains include over 6 establishments. Consequently chains with over 6 spots can be viewed as outliers.

df_chain.describe()

#counting outliers
df_chain.query('number_of_spots>6').count()

chain_name         23
number_of_spots    23
seats_average      23
dtype: int64

#checking outliers
df_chain.query('number_of_spots>6').sort_values(by='number_of_spots', ascending=False)

We have got 23 chains with more than 6 establishments. The names of this chains are well known, no surprizes here.

df_chain.query('number_of_spots==1')

There are 2240 chains with only one establishment.

#plotting distribution of average number of seats per establishment
sns.distplot(df_chain['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()

#plotting distribution of average number of seats per establishment excluding outliers
df_chain_no_outliers = df_chain.query('1<number_of_spots<6')
sns.distplot(df_chain_no_outliers['seats_average'], bins=30)
plt.title('Distribution of average number of seats per establishment (without outliers)')
plt.xlabel('Average number of seats')
plt.ylabel('Density')
plt.show()

Distribution looks almost the same.

#Calculatinug the 95th and 99th percentiles for the number of establishments (setting out outliers). 
np.percentile(df_chain['seats_average'], [95, 99])

array([142.  , 204.68])

Not more than 5% of chains contain over 142 seats and only 1% of chains have over 205 seats.

#combining the distributions
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()

#distributions without outliers
sns.jointplot(x="number_of_spots", y="seats_average", data=df_chain_no_outliers)
plt.title('Paired distribution of number of seats per establishment and number of establishments per chain')
plt.show()

Conclusion¶

There are lots of chains on the market (38% of all establishments). But the chains are mostly small: only 23 chains (1% of them) have over 6 establishments, only 5% of chains have over 3 establishments. We have 2240 chains with only 1 establishment. It might be a labeling error in the data or a those chains have only one branch in LA.

Distribution of number of seats looks more normal. Middle 50% of establishments have from 14 to 45 seats, average number of seats is 26. Not more than 5% of chains contain over 142 seats.

The current situation on the market looks as follows: chains are small, most of them have only one establishment in LA; there are lots of average sized chain establishments (26 seats on average); only 2 chains have over 30 establishments. Typical caracteristic for a chain in LA would be: 1-2 branches with 26 seats on average.

Average number of seats for each type of restaurant. On average, which type of restaurant has the greatest number of seats? ¶

back to top

previous part

Firstly we will have a look at the average number of seats in all establishments.

plt.title('Median number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df) #mean value
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

Clearly the restaurants are the the largest establishments with 48 seats on average. The smallest are bakeries (average 22 seats) and cafes (25 seats).

Now we will check the median number of seats for all establishments (to get rid of the influence of outliers).

from numpy import median

#plt.figure(figsize=(8, 8))
plt.title('Median number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df, estimator=median) #median value
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')

As we have gotten rid of the influence of outliers, the average numbers for the leaders dropped: for restaurants it is now 29 seats on average. The bars have almost the same average number. The average number of seats for all types of establishments is from 18 to 29 seats.

Now we will check the difference in average number of seats in chains in comparison to non-chain establishments.

#plotting the bar chart for chains vs original establishments
plt.figure(figsize=(10, 8))
plt.title('Average number of seats for each type of establishments')

ax = sns.barplot(x="object_type", y="number", data=df, hue='chain', estimator=median) 

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.xlabel("Types of establishments", size=14)
plt.ylabel("Number of seats", size=14)
plt.show()

Looking into restaurant data separetely:

df_rest = df.query('object_type=="Restaurant"')
sns.set(style="whitegrid")
sns.color_palette("deep")

plt.figure(figsize=(10, 3))  
plt.title('Average number of seats for restaurants')

ax = sns.barplot(x="number", y="object_type", data=df_rest, hue='chain', estimator=median) 
for p in ax.patches:
    width = p.get_width()
    plt.text(5+p.get_width(), p.get_y()+0.55*p.get_height(),
             '{:1.2f}'.format(width),
             ha='center', va='center')

plt.xlabel("Number of seats", size=14)
plt.ylabel("Type of restaurant", size=14)

plt.show()

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

There is almost no difference in average number of seats for cafes and restaurants. Wherease for the rest of the type the differece is quite big. The most dramatic difference we can see in capacity of bakeries and bars. Chain bakeries are much bigger (by 18 seats on average), although chain bars are much smaller (by 12 seats on average).

Street names ¶

back to top

previous part

In this section we are adding a column containing only street name, gettig rid of additional information in address of establishments.

#Filtering out the street name with usaddress library
#importing library
!pip install usaddress
import usaddress

Requirement already satisfied: usaddress in c:\users\hp\anaconda3\lib\site-packages (0.5.10)
Requirement already satisfied: python-crfsuite>=0.7 in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.9.7)
Requirement already satisfied: probableparsing in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.0.1)
Requirement already satisfied: future>=0.14 in c:\users\hp\anaconda3\lib\site-packages (from usaddress) (0.18.2)

def cleaning_check(raw):
    raw_address=usaddress.parse(raw)
    dict_address={}
    for i in raw_address:
        dict_address.update({i[1]:i[0]})
    #checking for normal case with street and street name type    
    if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
        clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
        return clean_adress
    else:
        return 'no street or street type'

df['street_check']=df.address.apply(cleaning_check).reset_index(drop=True)

df.head()

#checking for lines that got no street name through cleaning_check function
df[df['street_check']=='no street or street type']

#parsing one of the problem addresses to investigate the root of the problem
usaddress.parse(df.iloc[1888]['address'])

[('OLVERA', 'PlaceName'), ('ST', 'StateName'), ('E17', 'ZipCode')]

usaddress.parse(df.iloc[9622]['address'])

[('2118', 'AddressNumber'),
 ('N', 'StreetNamePreDirectional'),
 ('BROADWAY', 'StreetName')]

usaddress.parse(df.iloc[1]['address'])

[('100', 'AddressNumber'),
 ('WORLD', 'StreetName'),
 ('WAY', 'StreetNamePostType'),
 ('#', 'OccupancyIdentifier'),
 ('120', 'OccupancyIdentifier')]

def cleaning_final(raw):
    if raw.startswith('OLVERA'):
        clean_adress='OLVERA,Los Angeles,USA'
    elif raw.startswith('1033 1/2 LOS ANGELES'):
        clean_adress='1033 1/2 LOS ANGELES ST,Los Angeles,USA'
    else:
        raw_address=usaddress.parse(raw)
        dict_address={}
        for i in raw_address:
            dict_address.update({i[1]:i[0]})
        if 'StreetName' in dict_address.keys() and 'StreetNamePostType' in dict_address.keys():
            clean_adress= str(dict_address['StreetName'])+" "+str(dict_address['StreetNamePostType'])
            return clean_adress
        else:
            clean_adress1= str(dict_address['StreetName'])
        return clean_adress1
   # return clean_adress

df['street_name']=df.address.apply(cleaning_final).reset_index(drop=True)
df.sample(10)

#check for corrections in the problem addresses
df.query('street_name == "OLVERA ST"')

#check for corrections in the problem addresses
df.query('street_name == "LOS ANGELES ST"')

#dropping extra column:
df = df.drop('street_check', axis=1).reset_index(drop=True)
df.head()

df['street_name'].value_counts()

SUNSET BLVD      403
WILSHIRE BLVD    397
PICO BLVD        370
WESTERN AVE      367
FIGUEROA ST      332
                ... 
ALAVARADO ST       1
PORTIA ST          1
FETTERLY AVE       1
BARRY AVE          1
MANZANITA ST       1
Name: street_name, Length: 529, dtype: int64

We have got a list of 529 unique street names and formed a separate column 'street_name'.

df['street2'] = (df['address']
                 .replace('(^[0-9]+)','',regex=True)
                 .replace('[0-9 #]+$','',regex=True)
                 .replace('[-#.&,]','',regex=True)
                 .replace('["\\/]','',regex=True)
                 .replace('(^| ).(( ).)*( |$)','',regex=True)
                 .replace('( ){2}','',regex=True)
                 .replace('^ *','',regex=True)
                 .replace(['FLR', 'FL', 'STE', 'SC', 'UNIT', 'APT', 'CPCE', 'MEZZ', 'LOBBY', 'LBBY', 'ROOM', 'LOWR', 'BLDG', 'AU', '101C', '2B'],'',regex=True)
                )
df['street2'].value_counts()

SUNSET BLVD             336
PICO BLVD               331
FIGUEROA ST             270
WERN AVE                265
WILSHIRE BLVD           241
                       ... 
12HUNTINGTON DRSOUTH      1
PICO BLVD  10B            1
VENICE BLVD  C            1
GRANDE VISTA AVE          1
FETTERLY AVE              1
Name: street2, Length: 1016, dtype: int64

#Check up for the long street name
df.query('street2=="ECHO PARK AVE"')

#As I have chosen to proceed with the 1st method (column 'street_name') I will drop the dropping extra column:
df = df.drop('street2', axis=1).reset_index(drop=True)
df.head()

Top ten streets by number of restaurants ¶

[back to top](#top)

[previous part](#8)

We will define the most popular streets for all establishments firstly and then the most popular ones for restaurants.

streets_by_est = df.groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()

#renaming columns
streets_by_est.columns = ['street_name','establishments', 'avg_seats' ]
streets_by_est['avg_seats'] = streets_by_est['avg_seats'].astype(int)
streets_by_est.sort_values(by='establishments', ascending=False).head(10)

#saving top 10 streets by number of establishments to a separate data frame
top_10_streets = streets_by_est.query('establishments>=253').sort_values(by='establishments', ascending=False)
top_10_streets

plt.figure(figsize=(9,5))

# plot barh chart 
ax = sns.barplot(top_10_streets.street_name, top_10_streets.establishments, palette='mako') 
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set(xlabel="Streets", ylabel='Number of establishments')

ax.set_xticklabels(top_10_streets.street_name)
for item in ax.get_xticklabels(): item.set_rotation(90)
for i, v in enumerate(top_10_streets["establishments"].iteritems()):        
    ax.text(i ,v[1], "{:,}".format(v[1]), color='k', va ='bottom', rotation=45)

plt.title('Top 10 streets with the biggest number of establishments')
plt.show()

We got 10 the most popular streets with the biggest number of establishments.

Now we will check the most popular streets for restaurants.

top_10_streets['street_name'].unique()

array(['SUNSET BLVD', 'WILSHIRE BLVD', 'PICO BLVD', 'WESTERN AVE',
       'FIGUEROA ST', 'OLYMPIC BLVD', 'VERMONT AVE', 'MONICA BLVD',
       '3RD ST', 'HOLLYWOOD BLVD'], dtype=object)

streets_by_restaurants = df.query('object_type=="Restaurant"').groupby('street_name').agg({'id': ['count'], 'number': ['median']}).reset_index()
streets_by_restaurants.columns = ['street_name','establishments', 'avg_seats' ] #renaming columns
streets_by_restaurants['avg_seats'] = streets_by_restaurants['avg_seats'].astype(int)
streets_by_restaurants.sort_values(by='establishments', ascending=False).head(10)

We have got the same list of the streets with a slight difference in ranging.

We can assume that this is the list of the most popular and luxury streets in LA where most of establishments are restaurants.

These streets represent the most highly competitive region in LA.

Streets that only have one restaurant. ¶

back to top

previous part

#streets with only one establishment
one_establishment_str = streets_by_est.query('establishments==1').count()
one_establishment_str

street_name       238
establishments    238
avg_seats         238
dtype: int64

#streets with one restaurant:
one_rest_str = streets_by_restaurants.query('establishments==1').count()
one_rest_str

street_name       227
establishments    227
avg_seats         227
dtype: int64

According to our data there are 238 streets with only one establishment and 227 streets where the only establishment is a restaurant. In other words if there is an only establishment on the street, most likety is will be a restaurant.

Streets with a lot of restaurants: distribution of the number of seats. Trends. ¶

back to top

Firstly we will look at the distribution of number of establishments in the streets to define "streets with a lot of restaurants".

streets_by_est[['establishments', 'avg_seats']].describe()

#Calculatinug the percentiles for number of establishments per street. 
np.percentile(streets_by_est['establishments'], [85, 95, 99])

array([ 23.  ,  89.2 , 302.56])

#poltting distribution
plt.figure(figsize=(9,5))
sns.distplot(streets_by_est['establishments']) 
 
plt.title('number of establishments per street')
plt.show()

We can see that middle 50% of streets have 1-9 restaurants. Only 15% of streets have over 23 restaurants and only 5% of streets have over 90 restaurants. To define to point that means "a lot" we will perfom additional calculations.

We will define how popular is the street in relation to the mean value of establishments registrated there. To to that we will calculate z-score for the distribution as a numerical measurement that describes a value's relationship to the mean of a group of values. If a Z-score is 0, it indicates that the number of establishments in the street is identical to the mean value.

#counting z-score for number of establishments to define distribution 
streets_by_est['establishments_z']=(streets_by_est['establishments']-streets_by_est['establishments'].mean())/streets_by_est['establishments'].std()
streets_by_est= streets_by_est.sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est

Below there is a chart with a breaking point. There are much more streets to the both sides, this one is just an illustation of the idea of z-score ranging.

#distinguishing z-score by colors: red for negative values, blue for positive ones
streets_by_est_z = streets_by_est.query('-0.1 <=establishments_z<=0.1').sort_values(by='establishments', ascending=False).reset_index(drop=True)
streets_by_est_z['colors'] = ['red' if x<0 else 'blue' for x in streets_by_est_z['establishments_z']]
plt.figure(figsize=(14,10))
plt.hlines(y=streets_by_est_z.street_name, xmin=0, xmax=streets_by_est_z.establishments_z, colors=streets_by_est_z.colors, alpha=0.4, linewidth=10)
plt.title('Defining streets with a lot of restaurants')
plt.xlabel("z-score", size=14)
plt.show()

Now when we have defined the breaking point (Z=0; establishments = 22), we will form a new data frame containing only streets with a lot of restaurants.

#filtering out streets with many restaurants: 
streets_many_rest = streets_by_est.query('establishments_z>0')
streets_many_rest.head()

Now we will move to study of seats distribution.

streets_many_rest.describe()

#Calculatinug the percentiles for number of seats per restaurant. 
np.percentile(streets_many_rest['avg_seats'], [85, 95, 99])

array([33.35, 36.  , 49.13])

sns.distplot(streets_many_rest['avg_seats'], bins=10)
plt.title('Distribution of number of seats in restaurants on the highly competitive streets')
plt.show()

General distrubution of seats number looks as follows: average number is 26, middle 50% of restaurants have from 21 to 31 seats, maximum number is 100 seats. Only 5% of restaurants have more than 36 seats. There are no restaurants in the popular streets with less than 14 seats.

Now we will use qcut() method to split the streets into 5 equal groups.

pd.qcut(streets_many_rest['establishments'], 5)

0     (118.0, 403.0]
1     (118.0, 403.0]
2     (118.0, 403.0]
3     (118.0, 403.0]
4     (118.0, 403.0]
           ...      
87    (18.999, 26.2]
88    (18.999, 26.2]
89    (18.999, 26.2]
90    (18.999, 26.2]
91    (18.999, 26.2]
Name: establishments, Length: 92, dtype: category
Categories (5, interval[float64]): [(18.999, 26.2] < (26.2, 42.4] < (42.4, 74.4] < (74.4, 118.0] < (118.0, 403.0]]

#splitting streets into 5 categories

def cat(est):  
    if est <= 26:
        return '19-26'
    if est > 26 and est <= 42:
        return '27-42'
    if est > 42 and est <= 74:
        return '43-74'
    if est > 74 and est <= 118:
        return '75-118'
    return 'over 118'

streets_many_rest['establishments_number'] = streets_many_rest['establishments'].apply(cat)
streets_many_rest.head()

Now we have 5 groups of streets and we will investigate distribution of seats for every group.

plt.figure(figsize=(22,8))
sns.stripplot(x="establishments_number", y="avg_seats", data=streets_many_rest) 

plt.title('Scatter plot for seats per different number of restaurants per street', size=14)
plt.show()

plt.figure(figsize=(22,8))
sns.violinplot(x="establishments_number", y="avg_seats", data=streets_many_rest, palette='rainbow')
plt.show()

There is no clear difference between the groups in the niumber of seats. We can only say that the biggest diference is between restaurants on the streets with 19-26 establishments distribution with a big tail to the right, more outliers, the largest restaurants are in this category) and the most popular streets with over 118 restaurant (distribution is smaller, from 22 to 42 seats).

The streets with 43-74 establishments tend to have restaurants with capasity above average (28-34).

est_seats = streets_many_rest.drop(['establishments_z'], axis=1) #dropping extra column
sns.pairplot(est_seats, hue='establishments_number')

plt.title('Paired distributions of average number of seats on different street')
plt.show()

#pair plot without 3rd parameter
est_seats1 = streets_many_rest.drop(['establishments_z', 'establishments_number'], axis=1)
sns.pairplot(est_seats1)

plt.title('Paired distributions of average number of seats on different street')
plt.show()

Paired distribution confirms that the more establishments there are in the street the closer they are to an average size.

The smallest and the largest restaurants are in the streets with 19-26 establishments.

This can be viewed as a general trend.

Overall conclusion and recommendations on restaurant type and number of seats. Possibility of developing a chain. ¶

back to top

previous part

We have looked into the data on restaurants in LA and figured out the following:

75% of establishments are restaurants, 11% fast_food; bakery is the least popular type of establishments

38% of establishments belong to a chain

structure of market for chain and non-chains differs see pic

lots of chains have only 1 establishment in LA

a typical characteristic for a chain in LA would be: 1-2 branches with 26 seats on average.

an average number of seats for all types of establishments varies from 18 to 29 seats

based on the addresses of the establishments we have figured out the most popular streets: SUNSET, 'WILSHIRE', 'PICO', 'WESTERN', 'FIGUEROA', 'OLYMPIC', 'VERMONT', 'MONICA', '3RD', 'HOLLYWOOD.see chart

there are 185 streets with only one establishment. These are the streets that are not as competitive as the top ones but still they have their customer base and formed demand.

we have defined streets with lots of restaurants based on the average number of restaurants (which is 21 restaurant per street).

On average there are 26 seats in a restaurant that is placed in the popular streets (middle 50% of them have from 14 to 31 seats), maximum number of seats is 44. Only 1% of restaurants have over 40 seats.

Paired distribution shows that the more establishments there are in the street the closer they are to an average size. The smallest and the largest restaurants are in the streets with 22-30 establishments (among the popular streets) chart.

average number of seats is higher in chain bakeries, cafes and fast food, for the restaurants the average number of seats is approximately the same.

General conclusion. LA restaurant market is a highly competitive medium.
Based on the previous analysis we can offer 2 main strategies.

Stategy that requires more investments
- open original (non-chain) restaurant on one of the most popular streets of LA. These are areas with highly effective demand and high foot traffic.
- type of the establishment - restaurant. Most of the establiushments there are restaurants so the area has already formed demand and customers have certain expectations visiting this streats.
- number of seats 20-26 as this is an average number for most types of establishments in the area. Most likely this is the optimal rate of occupasy and payback of the premises.
- no-chain, highlight on the unique experience of the place.
Strategy that requires less investments
- open a restauranton a less popular area of LA with fewer restaurants. For example it could be the university area (dynamics, philosophy, vibes, open-mindnes of the area suits the idea of an innovative robot-run cafe without man labor) + expo area (large people flow, dynamics, innovations-inspired and driven medium).
- relatively low competitive area, only 2 fast-food eateries.
- type of the establishment - fast-food. It is cheaper to operate, no high-class Michelin chefs needed, suits the dynamics of the area. People already got used to a sertain type of food and have similar expectations (fast food)
- develop a chain:
  - additional income for selling franchise license,
  - economies of scale
- number of seats from 24 and up. Start at the lower point (24 seats is an average number for chains in LA), but there should be an opportunity to enlarge the capasity of the premises in the future. Supposedly rental fare in the such area is lower than in the top popular streets. AThus the restaurant should operate on a large scale and we could use economies of scale.

To make final desisions and choose the strategy we need some additional information:

rental fare in the chosen area,
average check amount and its dynamics,
average attendance of the restaurants in the area
what plans do athourities have for developing the area for the next 5-10 years (erecting new buildings, reconsructions, building new roads, interchanges)
calculating break-even point and other key metrics

In order to choose the final strategy we need additional thorough calulations. In current circumstances with Covid pandemic the idea of a robot-run eatery without people involved looks like a very timely one.

	id	object_name	address	chain	object_type	number
9384	21170	STARBUCKS #51056	1106 CORNWELL	True	Cafe	31
5843	17629	PENTHOUSE LOUNGE	1347 S ATLANTIC BLVD	False	Restaurant	81
4725	16511	THE GOOD PIZZA	7929 EMERSON AVE	False	Pizza	2
6422	18208	LA SERENATA	1842 E 1ST ST	True	Restaurant	44
8945	20731	LITTLE CEASARS # 5945	10714 S WESTERN AVE # 106A	False	Restaurant	4
3648	15434	ISAAN CLASSIC	2701 W SUNSET BLVD	False	Restaurant	42

	chain_name	number_of_spots	seats_average
1396	MAPO RESTAURANT	1	48.0
2192	STARBUCKS COFFEE #6467	1	14.0
2604	VIP CHINESE RESTAURANT	1	6.0
2220	STICKY RICE	2	25.5
2554	TRIMANA RESTAURANT	1	15.0

	chain_name	number_of_spots	seats_average
2459	THE COFFEE BEAN & TEA LEAF	47	24.0
2226	SUBWAY	31	16.0
519	DOMINO'S PIZZA	15	12.0
2608	WABA GRILL	14	41.5
1089	KENTUCKY FRIED CHICKEN	14	31.5
2544	TRIMANA	13	33.0
1433	MCDONALD'S	13	100.0
2667	YOGURTLAND	12	18.0
1735	PAPA JOHN'S PIZZA	12	10.0
878	HONG KONG EXPRESS	12	17.0
2103	STARBUCKS	12	19.0
2320	SUBWAY SANDWICHES	11	17.0
2635	WINGSTOP	10	17.0
1347	LOUISIANA FRIED CHICKEN	10	9.0
633	EL POLLO LOCO	10	47.0
375	CHIPOTLE MEXICAN GRILL	10	36.0
163	BLUE BOTTLE COFFEE	9	27.0
1095	KFC	9	37.0
1024	JERSEY MIKE'S SUBS	8	22.0
288	CARL'S JR	8	59.0
117	BASKIN ROBBINS	8	24.0
1790	PINKBERRY	7	17.0
348	CHINA EXPRESS	7	18.0

	chain_name	number_of_spots	seats_average
0	#1 CHINESE FAST FOOD	1	6.0
1	1810 RESTAURANT	1	45.0
2	333	1	180.0
3	3RD ST MARKET & DELI	1	7.0
4	4 CAMINOS RESTAURANT	1	18.0
...	...	...	...
2727	ZAHN EMERGENCY SHELTER	1	61.0
2729	ZEN BAKERY (RETAIL)	1	19.0
2730	ZERMENOS RESTAURANT INC	1	10.0
2731	ZION MARKET DELI & ISLAND	1	16.0
2732	ZIP SUSHI & IZAKAYA	1	98.0

	id	object_name	address	chain	object_type	number	street_check	street_name
1060	12846	HAIFA RESTAURANT	8717 W PICO BLVD	True	Restaurant	38	PICO BLVD	PICO BLVD
2307	14093	SUPERBA FOOD & BREAD	1900 LINCOLN BLVD	True	Restaurant	102	LINCOLN BLVD	LINCOLN BLVD
3227	15013	BURGER KING #11024	5609 W SUNSET BLVD STE C	True	Fast Food	40	SUNSET BLVD	SUNSET BLVD
8223	20009	CANDILEJAS CAFE	3425 E 1ST ST STE 104B	False	Restaurant	3	1ST ST	1ST ST
2525	14311	BAMBOO EXPRESS	600 S ALVARADO ST # 107-8	False	Fast Food	28	ALVARADO ST	ALVARADO ST
3136	14922	CHIPOTLE MEXICAN GRILL	5047 EAGLE ROCK BLVD	True	Restaurant	37	ROCK BLVD	ROCK BLVD
2646	14432	ARTUR'S GARDEN	1240 HOOPER AVE	False	Restaurant	13	HOOPER AVE	HOOPER AVE
4875	16661	MCDONALD'S #4844	1210 S SOTO ST	True	Restaurant	61	SOTO ST	SOTO ST
7196	18982	CAFE AMERICANO	5770 MELROSE AVE STE 101	False	Restaurant	27	MELROSE AVE	MELROSE AVE
108	11894	JINYA RAMEN EXPRESS	6801 HOLLYWOOD BLVD # 315	False	Fast Food	6	HOLLYWOOD BLVD	HOLLYWOOD BLVD

	id	object_name	address	chain	object_type	number
0	11786	HABITAT COFFEE SHOP	3708 N EAGLE ROCK BLVD	False	Cafe	26
1	11787	REILLY'S	100 WORLD WAY # 120	False	Restaurant	9
2	11788	STREET CHURROS	6801 HOLLYWOOD BLVD # 253	False	Fast Food	20
3	11789	TRINITI ECHO PARK	1814 W SUNSET BLVD	False	Restaurant	22
4	11790	POLLEN	2100 ECHO PARK AVE	False	Restaurant	20

	object_name	address	chain	object_type
count	9651	9651	9648	9651
unique	8672	8517	2	6
top	THE COFFEE BEAN & TEA LEAF	3607 TROUSDALE PKWY	False	Restaurant
freq	47	11	5972	7255

	id	object_name	address	chain	object_type	number
457	12243	USC MAIN KITCHEN-BASEMENT	3607 TROUSDALE PKWY	True	Restaurant	227
462	12248	TRADITIONS PUB	3607 TROUSDALE PKWY	False	Restaurant	197
480	12266	PANDA EXPRESS	3607 TROUSDALE PKWY	False	Fast Food	40
858	12644	MORETON FIGS RESTAURANT/BAR	3607 TROUSDALE PKWY	True	Bar	107
863	12649	VERDE	3607 TROUSDALE PKWY	False	Restaurant	39
888	12674	CALIFORNIA PIZZA KITCHEN	3607 TROUSDALE PKWY	True	Pizza	44
2610	14396	THE COFFEE BEAN & TEA LEAF	3607 TROUSDALE PKWY	True	Cafe	43
3268	15054	THE HABIT BURGER GRILL	3607 TROUSDALE PKWY	False	Fast Food	43
6311	18097	LEMONADE	3607 TROUSDALE PKWY	True	Restaurant	138
7698	19484	LEMONADES	3607 TROUSDALE PKWY	True	Restaurant	111
7756	19542	COFFEE BEAN & TEA LEAF	3607 TROUSDALE PKWY	True	Cafe	31

	id	object_name	address	chain	object_type	number
7408	19194	TAQUERIA LOS 3 CARNALES	5000 E WHITTIER BLVD	NaN	Restaurant	14
7523	19309	JAMMIN JIMMY'S PIZZA	1641 FIRESTONE BLVD	NaN	Pizza	1
8648	20434	THE LEXINGTON THEATER	129 E 3RD ST	NaN	Restaurant	35

	object_type	id
0	Restaurant	4961
1	Fast Food	461
2	Bar	215
3	Cafe	169
4	Pizza	166
5	Bakery	0

	object_type	number
0	Restaurant	2292
1	Fast Food	605
2	Bakery	283
3	Cafe	266
4	Pizza	153
5	Bar	77

	number_of_spots	seats_average
count	2733.000000	2733.000000
mean	1.345042	41.361325
std	1.489055	44.319385
min	1.000000	1.000000
25%	1.000000	14.000000
50%	1.000000	26.000000
75%	1.000000	45.000000
max	47.000000	229.000000

	id	object_name	address	chain	object_type	number	street_check
138	11924	CARL'S JR	308 WESTWOOD PLZ 505	True	Restaurant	37	no street or street type
207	11993	CLIFTON'S CAFETERIA	648 S BROADWAY	False	Restaurant	2	no street or street type
209	11995	GOURMENT LA BAKERY	548 S BROADWAY	True	Bakery	27	no street or street type
257	12043	B TWENTY FOUR	2223 N BROADWAY	False	Restaurant	15	no street or street type
267	12053	THE HEIGHTS DELI AND BOTTLE SHOP	2927 N BROADWAY	True	Restaurant	11	no street or street type
...	...	...	...	...	...	...	...
9572	21358	LE MACARON	541 S SPRING # 127	True	Restaurant	17	no street or street type
9575	21361	CHINESE TASTE FAST FOOD	1630 E CESAR E CHAVEZ	True	Restaurant	13	no street or street type
9587	21373	VALENCIA REST. GROUP #1	727 N BROADWAY # 120	False	Restaurant	39	no street or street type
9590	21376	PIRATES MEXICAN FOOD	8511 S BROADWAY	True	Restaurant	5	no street or street type
9622	21408	LINCOLN KITCHEN & TAP	2118 N BROADWAY	False	Restaurant	76	no street or street type

	street_name	establishments	avg_seats
446	SUNSET BLVD	403	34
511	WILSHIRE BLVD	397	36
390	PICO BLVD	370	27
499	WESTERN AVE	367	29
211	FIGUEROA ST	332	25
370	OLYMPIC BLVD	309	29
469	VERMONT AVE	286	28
347	MONICA BLVD	264	22
25	3RD ST	261	26
263	HOLLYWOOD BLVD	253	34

	establishments	avg_seats
count	529.000000	529.000000
mean	18.226843	45.763705
std	51.111709	46.346306
min	1.000000	2.000000
25%	1.000000	19.000000
50%	2.000000	28.000000
75%	9.000000	47.000000
max	403.000000	229.000000

	street_name	establishments	avg_seats	establishments_z
0	SUNSET BLVD	403	34	7.528082
1	WILSHIRE BLVD	397	36	7.410692
2	PICO BLVD	370	27	6.882438
3	WESTERN AVE	367	29	6.823743
4	FIGUEROA ST	332	25	6.138968
...	...	...	...	...
524	GAREY ST	1	122	-0.337043
525	GARDNER ST	1	91	-0.337043
526	FRANCISCO ST	1	133	-0.337043
527	FORMOSA AVE	1	144	-0.337043
528	vine ST	1	162	-0.337043

	establishments	avg_seats	establishments_z
count	92.000000	92.000000	92.000000
mean	90.445652	27.010870	1.412960
std	93.343436	10.092424	1.826263
min	19.000000	14.000000	0.015127
25%	31.500000	21.750000	0.259689
50%	54.500000	26.000000	0.709684
75%	97.750000	32.000000	1.555870
max	403.000000	101.000000	7.528082

	street_name	establishments	avg_seats	establishments_z	establishments_number
0	SUNSET BLVD	403	34	7.528082	over 118
1	WILSHIRE BLVD	397	36	7.410692	over 118
2	PICO BLVD	370	27	6.882438	over 118
3	WESTERN AVE	367	29	6.823743	over 118
4	FIGUEROA ST	332	25	6.138968	over 118