Restaurant Recommendation Challenge

Authored By : Vinitha Racharla

Hello Guys today let us know about restaurant recommendation Challenge, on what is it actually and how it will benefit customers.

This blog is authored by Vinitha Racharla a student from University Of Hyderabad.

INTRODUCTION:

Recommendation systems helps the users or the customers to find the items that they require. In this era, we have lots of available items on the web. The recommendation systems makes easy to the customer to find or select an item they are interested in. Given certain data or information about the user, the system predicts and recommends the appropriate item to the user.

        Here, the problem is all about classifying a restaurant to the customer that means we need to find whether a given restaurant can be recommended to the customer or not. Everytime when we want to eat, the major concern will be like what to eat and “where to eat”. This system makes easy to the user or customer to select a restaurant that they like, given what kind of cuisine they like or the rating of the restaurant or the ambience and many other features.

         The impact we get by solving this problem is huge because it makes easy for a user to select a restaurant with all the features(ambience,cost,ratings etc..) they want. It also improves the business of the restaurant like they can change or improve their business according to their customers needs.

References:

  1. Pragya Shrestha & Salu Khadka, August 2017,Restaurant Recommendation and Classification model . The analysis was based on the yelp dataset available in kaggle. In this paper ,the classification was done using the restaurant reviews.
  2. Theo Jeremiah , Nov 29,2019, How to build a restaurant recommendation system using Latent Factor Collaborative Filtering. Here also the analysis was done based on the yelp restaurant data.
  3. Molly Liebeskind, mar 27,2020,A simple approach to building a recommendation system, leveraging the surprise package to build a collaborative filtering recommender in Python.
  4. D.S.Gaikwad, Anup.V.Deshpande, Nikhil.N.Nalwar, Manoj.V.Katkar, Aniket.V.Salave , Food Recommendation System.

PHASE-1:  LITERATURE SURVEY AND  DATA ACQUISITION

Dataset:

        Here, we have the whole dataset we need for restaurant recommendation or classification problem that we got from Kaggle ( https://www.kaggle.com/mrmorj/restaurant-recommendation-challenge ). This dataset contain different files like train_full,test_full,train_locations,test_locations,train_customers,test_customers,orders and vendors.

Features in the dataset:

  • Customer_id: The id of the customer.
  • Gender: The gender of the customer. Most of the rows contain male. About 92% of the data contain male and the remaining 8% contain female.
  • Status_x and verified_x: The status of the customer.
  • Created_at_x and updated_at_x: The date and time when the account is created.
  • location_number and location_type: The number of locations. Most of the customers could have more than 1 or 2 locations.Type of the location is like work, home, other. As per the analysis Home is the most chosen location type.
  • Latitude_x and longitude_x: The latitude and longitude of the place the customer is.
  • Id: the id of the vendor.
  •  Latitude_y and longitude_y:The latitude and longitude of the vendor.Same as the customer.
  • Vendor_category_en: The category of the vendor(restaurant, sweets and bakes etc….)
  • Vendor_category_id: The id of category of vendor.
  • Delivery charge: The cost for delivery. Most of the vendors take a delivery charge of 0.7 and other vendors does not cost any charge for the delivery.Nearly,60% of the members are charged for the delivery.
  • Is_open: If it is open or not. If the restaurant is open it shows 1 otherwise 0.
  • Openingtime: The timing when the restaurant is available.
  • Preparation time: The time it takes to prepare. As the analysis shows, most of the vendors take approximate time of 10-15 minutes to prepare the item.
  • Discount_percentage: The percentage of the discount given to the customer. From the analysis , we can know that most of them do not get any discount.  
  • Rank: The rank of the restaurant, consists of mostly 1 and 11.
  • Vendor_rating: The rating of the vendor.
  • The availability timing of the vendor is given for every specific day.
  • Vendor_tag_name: Shows the different items which are available.
  • cid x loc_num x vendor: Customer id, location number, vendor number.
  • Payment_mode: The mode of the payment done. Most of them, approximately 77% of the customers preferred the second mode of the payment. Around 16% of the customers preferred the first mode of the payment.
  • Dob: The year of the date of birth. This column contains many null values. Around or nearly 32000 rows contain null values.

The whole dataset is of  data types, they are integer, Boolean, float and string.

The whole dataset is of the size 4GB. For a monitor with high RAM, the data can be easily processed. But for a monitor with low RAM like 4GB or 8GB, its hard to process.

      We have tools like dask which can be used to process large amount of data. The dataset contains large amount of null values so certain techniques have to be used to fill them up.

     Pandas can also be used to process large amount of data. As the data is in form of tables pandas can also be used to perform functions on the data.

Data Acquisition:

         For this project, we have acquired the data from the kaggle which has like ocean of datasets. We currently live in an era of information. We are now surrounded by a heap of data or lots of data in many forms. Data is available in the form of blogs or reviews or websites like kaggle and even papers. Usage of internet has increased drastically since the 2000’s. The flow of information has increased and this made easier to collect the data or acquire the data. Websites or apps like instagram and facebook and even whatsapp helps a person to share information to other person.

          So, more data can be acquired in many ways. It may be by scrapping or even some websites like kaggle etc.

Performance Metric:

F1 Score is one of the appropriate key metric or performance metric that can be used for the binary classification problem. It is defined as harmonic mean of precision and recall.

            F1 Score = 2*(precision*recall) / (precision+recall)

Accuracy can also be used for this problem but since the data is slightly imbalanced accuracy cannot be used for this data set.

Pros: F1 Score can be used for imbalanced sets also.

Cons: F1 score is not interpretable.

F1 Score is mainly used in binary classification problems.

The most real world constraint here is that many of the features have null values. Null values make harder to solve a problem and meet the requirements.

The most important requirement is that it should be low latency and the recommendation accuracy should be good. The recommendation should be very fast like within seconds and it should meet the customer’s requirements.   

PHASE-2 : EDA AND FEATURE EXTRACTION

Dataset level and output variable analysis:

              The dataset available for the problem is in the form of a dataframe. The shape of the dataset is (5802400, 73). That means there are 5802400 rows and 73 columns in the dataset. Out of 73 columns 45 columns are of object datatype and 12 are of integer type and the remaining 16 of them is of float data type.

             In the dataset , we have the output variable as ‘ target ’. The target variable is of Boolean datatype that is if the restaurant is recommended to the customer then the target variable will be 1 otherwise 0. The target variable or the output variable is very imbalanced.

             Approximately more than 95% of the dataset has the target variable as 0 and approximate of 5% of the value in the dataset has the target variable as 1. It shows how imbalance the dataset is. That is the reason we donot use accuracy as our key metric. Instead we can use F1 score or log loss as our key metric or performance metric.

Univariate Feature Analysis:

             Univariate Feature Analysis is nothing but the analysis of data .In the name, Uni means one, that means analyzing each variable or feature in the dataset.      

             The main objective of the univariate analysis is to summarise the data and analyzing the pattern of the data in the dataset.

 There are many types of performing a univariate analysis and some of them are:

  1. Statistical methods such as:
  2. Finding the Central tendency
  3. Finding the Dispersion
  4. Max, min values
  5. Count
  6. 25 , 50, 75 percentiles and so on.
  • Data Visualization :

          Analysis can be easily understandable by using the visualization techniques . It is mostly recommended method of analysizing the data or feature.

We have several libraries like seaborn and matplotlib for visualizing the data in the dataset.

        For univariate analysis, for calculating the central tendency we can use mean or median and for dispersion we can use mean absolute deviation, median absolute deviation or IQR ie., Inter quartile range.

But, since the data is a lot imbalanced, mean and mean absolute deviation cannot be used. It is because mean and mean absolute deviation can be affected if there are any outliers. So, the best methods which can be used are median and median absolute deviation.

          MEDIAN = It is nothing but the middle value in the sorted values of the data set either ascending or descending.

                  Therefore,

                If N is the number of values in the data then,

  • Median= the middle value ( if N is odd) ie., (N+1)/2 th value is median

                                            Else, if N is even

  • Median= the arithematic mean of the two middle values.

                                           Ie., the mean of [N/2 +{(N/2)+1}] th value

Median,   Pros: It is not effected by the outliers.

                Cons: It only looks at the middle of the series ie., it doesnot take into account

                           all the information in the data.  

Median Absolute Deviation is defined as the median of the absolute deviations of the point from the median.

          Median Absolute deviation(MAD)= Median[abs(Xi – M)]

                                  Where, Xi = Data point

                                             M= Median of the data points.

MAD, Pros:

  • It is robust
  • Not affected by the outliers

There are many different types of plots present in the libraries for visualization

  • Dist plot: Dist plot gives the histogram of the given variable. It is

one of the example of univariate analysis.

  • Rug plot: Instead of uniform distribution it draws a dash mark in the plot. It is also an example of univariate analysis.
  • Count Plot: It gives the count or in simple words it gives the number of occurrences of a categorical variable.

Rug plot is not oftenly used instead distplot and count plot are used oftenly for univariate analysis.

         All these plots are present  or available in a library or module called seaborn.

Many other different plots also can be used such as:

  1. Bar Charts
  2. Pie Charts
  3. Histograms
  4. Frequency polygons etc…

Above pie plot is for target variable which shows that only 1% of the values is 1 and the remaining 99% of the values is 0. This shows how imbalance the dataset is.

The above histogram plot is of the feature location type. There are 3 values in location type column and they are work, home, and other. We can observe from the histogram plot that most of the values are home and then work.

The plot above is a pie chart of the feature ‘gender’ and 92% of the values are male and 8%of the values are female.

Multivariate Feature Analysis:

   Multivariate analysis is defined as the analysis of the data with more than two features or variables. Analysis involving exactly two variables or variates is bivariate analysis.

     The main objective or aim of multivariate analysis is to find the correlation and the pattern among the features in the dataset.

The analysis of the dataset with exactly two variables or features is called ‘bivariate analysis’.

There are many techniques which can be used for multivariate analysis, one of the best technique or method is ‘correlation’.

         Correlation is a way of understanding the relation between the variables. In simple way, it shows the relationship between the features. So for any pair of features, we have

  • Positive correlation: If the correlation between the variables is positive then it means that the variables move or change in the same direction.
  • Neutral correlation: If the correlation between the variables is 0 or neutral them the variables donot have any relationship between them.
  • Negative correlation:  If the correlation between the variables is negative then it means that the variables move or change in the opposite direction.

Generally, one of the most common method used to find the correlation ie., the linear relationship between the variables is Pearson’s correlation coefficient.

        It is defined as,

                                Pearson’s correlation coefficient(X,Y)=      cov(X,Y)                        

                                                                                               Std(X)std(Y)

Where,  cov(X,Y) = covariance between X and Y

            X, Y= Variables

           Std(X) = Standard deviation of the variable X

          Std(Y) = Standard deviation of the variable Y

Advantages:

  • It helps to determine the strength and the direction of the relationship ie., it shows whether their flow of direction is same or the opposite way. We can also know how strong the relationship is, from the coefficient value obatained.

Disadvantages:

  • It can only be performed for measurable variables that is the variables that are numeric.
  • It donot give the causation of the linear relationship between the variables. In simple words it doesnot say which variable cause the other.

One of the best techniques for visualization is Heatmap. There are some other techniques which include;

  1. Joint Plot: It can be used for bivariate analysis. We can also get the linear relationship of the variables by obtaining the scatter plot.
  2.  Pair plot: This plot provides the pairwise scatter plots for the given variables. It also gives the histograms for the variables.
  3. Heat map: Without visualization technique, it becomes harder to go through every value. So heatmap can be used to understand the correlation data very well and easily.

If we want to group some features according to their relationship or similarities, “Cluster map” can be really helpful for visualization and understanding the data.

Above plot is a joint plot between target variable and delivery charge feature. We can see from the plot that whatever the delivery charge be that is 0 or 0.7, the target variable is both 0 and 1. It shows that delivery charge feature doesnot effect target variable.

Above plot between Target variable and serving_distance feature shows that whatever the given serving distance is, the target variable is both 1 and 0.

Above is a heat map, it shows the correlation between the features in the dataset. The lightest color means high correlation and the darkest color means very low correlation.

From the above plot we can observe that the target variable is 0 for all the given location numbers but the target is 1 upto some location numbers that is till 20 and after 20 the target gives only 0.
Encoding:

                Encoding is nothing but changing or converting the feature from one form to another form. In the dataset out of 73 columns there are 45 categorical features that means 45 variables are of object datatype.

        Since the models we use only understand numeric data we have to convert the categorical data into numeric data which is either integer or float.

        There are two types of categorical data:

  1. Ordinal: Ordinal data means the data which can be ordered or arranged in an order. For example; excellent, good, bad are ordinal.
  2. Nominal: Nominal data is the data which cannot be ordered or arranged. For example; male, female or India, USA, UK are nominal.

The different methods that can be used to encode the data:

  1. Ordinal:
  2. Target guided ordinal encoding: In this method, the variables are ranked based on the mean of their target variable.
  3. Label Encoding: This method provides the rank number based on the importance of the category.
  4. Nominal:
  5. One hot encoding
  6. Mean encoding: This method helps us to convert the categorical data to  mean based on their target variable or output variable.

One Hot Encoding:

                This method creates dummy variables for the categorical variables.

To avoid dummy variable trap we drop one of the column, so if there are ‘n’ variables we get ‘n-1’ columns.

        If there are many categorical variables then we take the most frequent variables and perform one hot encoding.

           For example, for n variables we take the most frequent variables let say 9, then one hot encoding is done on those 9 variables.

       Advantages: It makes the data easy and useful for performing the machine learning       models.

Disadvantages: It will lead to curse of dimensionality.

Methods for filling the missing values:

  In the given dataset, there many null values. These null values have to be filled up to get good accuracy of the model.

    We have many methods that can be used to fill up the missing values such as:

  • Deleting the column of the missing data: In this technique we simply remove the column from the dataset that has null values. But this method is not recommended because the column can be an important feature which should be deleted.
  • Deleting the row with missing data: In this method we remove the row that contain the null values. This provides more accuracy for the model than before because there could be important information in the feature so deleting the column is not a good method and is not recommended.
  • Filling the missing values(by imputation)
  • Imputation using additional column: This method adds an extra column and imputes the missing value using the imputer function.   
  • Filling with a regression model: In this method, the missing values are filled using a model like KNN.

Filling the missing values (by imputation):

        This technique uses various methods to impute the missing values in the dataset. Some of the methods are:

  1. Filling the data using the mean or median
  2. Filling the data using mode
  3. Filling the numerical missing value with 0 or any other number that is not present in the dataset: This method is not much recommendable because it may lead to outliers which could affect our models.
  4. Filling with a new type of categorical value: This method can be applied only for categorical variable. This method needs a lot of domain knowledge to fill the missing value.

In these some of the methods, the best methods that can be used is to impute by using mean, median, mode.

Mean, median imputation is used for numerical features.

Mode imputation is used for categorical features.

PHASE-3:  MODELING AND ERROR ANALYSIS

Introduction:

       The problem we have is a restaurant recommendation challenge. The challenge is about classifying whether a specified restaurant is recommended to the customer. If the target variable is 0 then the restaurant is not recommended to the customer and if the target variable is 1 then the specified restaurant is recommended to the customer.

            The dataset provided for our challenge is highly imbalanced. The target variable approximately consists 99% of the value as 0 and remaining 1% is of value 1.

            Since the challenge is a binary classification problem we can use baseline models like Logistic regression and F1 score as our performance metric.

Baseline models:

          A baseline model is nothing but a model which uses machine learning or randomness, statistics to predict the target variable in the dataset. Baseline models should be simple and simple models are less likely to overfit. Baseline model should be interpretable.

   Since the problem is a binary classification problem. Logistic regression can be used to predict the target values.

Logistic Regression:

         Logistic Regression is a model used for analytics and modeling.It is a predictive analysis model. It helps to predict the target variable. Logistic regression also called as logit fits better for the challenges which involve 2 values for prediction that is it fits better for binary classification problem.

           This model can also be extended on image detection that is whether an image is present or not. For example there is an image of a cat logistic regression can be used to predict whether the cat is present or not.

          Logistic regression is used in many fields like machine learning, it’s also used in most medical fields. This model can be used for binary or ordinal or multinomial type of data. Binary logistic regression deals with the situations where the predictive variable is of two types. For this problem its recommending a specific restaurant to the customer or not.

                  Multinomial logistic regression deals with the outcomes that are more than two types and this is not ordered. Whereas ordinal is ordered.

Assumptions:

  • The main and important assumption of logistic regression is that it assumes the relationship between the variables of logit and the response variable is linear.
  • The response variable is binary : Logistic regression assumes that there are only binary that is two possible outcomes.
  • The observations are independent: It assumes that the observations should be independent to each other that means that they should not be related to each other in any way.
  • There is no multicollinearity among Explanatory Variables : There should not be any collinearity between the variables.
  • There are no extreme outliers: Logistic regression assumes that there are no outliers in the dataset.
  • The sample size is sufficiently large: It assumes that the sample size is large enough to extract the conclusions that we need     

Advantages:

  • It is easy to implement and interpret and it is very efficient to train.
  • It doesnot make any assumption on the distribution of the classes
  • It can easily extended to multiple classes by using multinomial logistic regression.
  • It can provide good accuracy

            Disadvantage:

  • It may lead to overfit if the number of observations are less than the number of features.
  • The assumption that there is linear relationship between the dependent and independent variables is the major limitation.
  • It can only be used to predict the variables which are discrete. It does not predict for continuous variables.
  • The problems which are not linear cannot be solved using logistic regression because the decision surface is linear.  

     

      Random Forest:
                    

Random Forest also called as random decision forest are ensemble      models used for classification, regression and others. Ensemble is nothing but combining multiple models. The output for the problem is the class that is obtained by most trees, this is for classification task. For regression like tasks the output is taken as the mean or the average of the trees.

Random forests perform very well compared to the decision trees. Random forests can also help to rank the variables that means we can know the importance of the variables in the dataset.

The properties of random forests:

Ø Variable importance: Random forests can be used to rank the variables in a classification problem or regression problem.

Ø Relationship to nearest neighbors: The relationship between random forests and K nearest neighbors was found. Both can be viewed as weighted neighborhood schemes

        There is also a kind of random forest called kernel random forest. This kernel random forest establish the connection between the random forests and kernel methods.

          Since the random forest uses ensembling technique.

 Ensemble uses two types of methods.

  1. Bagging
  2. Boosting: It creates the sequential models by combining the models such that we get good accuracy. The algorithms which use boosting is ADA boost, XG boost.

  Bagging:

        In bagging, it creates a subset for training for training data with replacement and the final output we need is taken by majority voting. Random forest uses bagging.

         Bagging is also known as bootstrap aggregation. Bagging selects a sample from the dataset and each sample is called bootstrap sample. Hence the model is generated from the bootstrap samples which is provided from the original dataset with replacement known as row sampling.

         The step of row sampling with replacement is called bootstrap. Then each model is generated with the samples independently and are trained to generate the results. Then all the results are combined and the result or the output we need is obtained by the majority vote. The step where the results are combined is called aggregation.

Important features or properties of random forest:

  • Diversity: All variables or features are not considered while making a tree. Each tree is independent to each other.
  • Immune to curse of dimensionality: As all the features are not considered, the space will be reduced.
  • Parallelization: Each tree is build independently with different attributes so we can make full use of the CPU
  • Train_test split: There is no need to split the data into train and test since 30% of the data is already not seen by the model.
  • Stability: Since the result is based on majority voting there will be stability in our model.  

Advantages of Random Forests:

  • It can be used in both classification and regression problems.
  • The problem of overfitting is solved since the output is based on majority voting or average.
  • Even if the data contain missing values the model works absolutely fine.
  • Since each tree is independent to each other it has the property of parallelization.
  • It is highly stable.
  • It maintains diversity since all the attributes are not considered while making decision tree.
  • It is immune to curse of dimensionality.
  • We neednot split the data to train and test.

Disadvantages:

  • Random forest is highly complex when compared to decision trees.
  • Training time is more when compared to other models because of its complexity.
  • Accuracy is low in random forest for complex problems when compared to gradient boosted trees.

        Performance metric:

                  Since the dataset is very imbalanced, accuracy cannot be used as a performance metric. Instead F1 score and log loss can be used as our performance metric.

         F1 score:

                    F1 Score is one of the appropriate key metric or performance metric that can be used for the binary classification problem. It is defined as harmonic mean of precision and recall.

            F1 Score = 2*(precision*recall) / (precision+recall)

Pros:

       F1 Score can be used for imbalanced sets also.

Cons:

        F1 score is not interpretable.

Log loss:

          Log loss is a classification metric based on probabilities. For any given model lower the log loss, higher the predictions.

       Log loss is nothing but the negative average of the log of corrected predicted probabilities for each instance. Log loss is also called as binary cross entropy.

              Log_loss= -1/N

              Log-loss= -1/N

Pros:

  • It leads to better probabilistic estimation

Cons:

  • The data with distributions having long tails are modeled poorly.

PHASE-4 : ADVANCED MODELING AND FEATURE ENGINEERING

Introduction:

           The problem is all about recommending a restaurant to the customer. It is a classification model that is given a restaurant we need to decide whether the restaurant can be recommended to the customer. If the restaurant can be recommended then the target variable is 1 and if the restaurant is not recommended then the target variable is 0.

        After all the exploratory data analysis and featurization, a model is fit to the train data and then the test data is predicted.

Performance metrics used:

          Performance metric like F1 score and log loss is used to calculate or analyze the performance of the models.

Models used for classification:

  1. Logistic Regression
  2. Random Forest
  3. Decision Tree
  4. Support vector classifier
  5. Gradient boosting classifier

There are also many more models that can be used like KNN etc..

  1. Logistic Regression

    Logistic regression is a analysis method for classification of an outcome variable given an input variable.

       Assumptions:

  • The main and important assumption of logistic regression is that it assumes the relationship between the variables of logit and the response variable is linear.
  • The response variable is binary : Logistic regression assumes that there are only binary that is two possible outcomes.
  • The observations are independent: It assumes that the observations should be independent to each other that means that they should not be related to each other in any way.
  • There is no multicollinearity among Explanatory Variables : There should not be any collinearity between the variables.
  • There are no extreme outliers: Logistic regression assumes that there are no outliers in the dataset.

Analysis of performance of the model:

  Confusion matrix:

F1 Score:

Accuracy score:

Log loss:

Disadvantages:

  • It may lead to overfit if the number of observations are less than the number of features.
  • The assumption that there is linear relationship between the dependent and independent variables is the major limitation.
  • It can only be used to predict the variables which are discrete. It does not predict for continuous variables.
  • The problems which are not linear cannot be solved using logistic regression because the decision surface is linear.  

2: Decision Tree:

             Decision tree is a tool that uses a tree like model to make decisions for the output variable given an input variable.

Assumptions:

  • At the beginning , the whole training dataset is considered as root.
  • A statistical method is used for ordering attributes as root node or internal node.

Analysis of performance of the model:

Confusion matrix:

           F1 Score:

       Accuracy score:

Disadvantages:

  • Unstable nature: A small change in data can lead to major change in the structure of decision tree.
  • Less effective in predicting the outcome of continuous variable: this model is less effective if the outcome variable to be predicted is continuous as it tends to lose information when categorizing variables.

3. Random Forest:

                    Random Forest also called as random decision forest are ensemble      models used for classification, regression and others. Ensemble is nothing but combining multiple models. The output for the problem is the class that is obtained by most trees, this is for classification task. For regression like tasks the output is taken as the mean or the average of the trees.

                      Random forests perform very well compared to the decision trees. Random forests can also help to rank the variables that means we can know the importance of the variables in the dataset.

Important features or properties of random forest:

  • Diversity: All variables or features are not considered while making a tree. Each tree is independent to each other.
  • Immune to curse of dimensionality: As all the features are not considered, the space will be reduced.
  • Parallelization: Each tree is build independently with different attributes so we can make full use of the CPU
  • Train_test split: There is no need to split the data into train and test since 30% of the data is already not seen by the model.
  • Stability: Since the result is based on majority voting there will be stability in our model.  

Analysis of performance of the model:

Confusion matrix:

             F1 Score:

      Accuracy score:

 Log loss:

Disadvantages:

  • Random forest is highly complex when compared to decision trees.
  • Training time is more when compared to other models because of its complexity.
  • Accuracy is low in random forest for complex problems when compared to gradient boosted trees.

4.Gradient boosting classifier:

        It is a group of machine learning algorithms that is it combines weak models to get one strong model for prediction.

Analysis of performance of the model:

 Confusion matrix:

              F1 Score:

Accuracy score:

 Log loss:

Advantages:

  • Lots of flexibility.
  • No need of data pre processing.
  • Handles missing data.

Disadvantages:

  • May cause overfitting since this models will continue to improve to solve the errors.
  • Computationally expensive.
  • It is less interpretative in nature.

5.SVM-Support Vector Machines:

                It is a supervised machine learning algorithm that is used for both classification and regression problems.

Analysis of performance of the model:

Confusion matrix:

              F1 Score:

Accuracy score:

Log loss:

Pros:

  • Works well with clear margin of separation.
  • Effective in high dimensional spaces.
  • Effective in cases where number of dimensions is greater than number of samples.
  • It is also memory efficient since it uses subset of training points called support vectors.

Cons:

  • Does not perform well for large datasets as the required training time is higher.
  • Does not perform well when the target classes in the dataset are overlapping.
  • An expensive five fold cross validation should be calculated for probability estimates.

             Since the dataset is imbalanced, the above models give 0 as their output in most of the cases. So to prevent this Weighted Logistic Regression can be used by giving high weightage to the value which is less in count.

Weighted Logistic Regression:

        Weighted logistic regression is nothing but giving the weightage to the class which is less in count. It uses the values of y and adjusts the weights inversely proportional to the class frequencies.

Analysis of performance of the model:

Confusion matrix:

F1 Score:

PHASE 5: DEPLOYMENT AND PRODUCTIONIZATION

               The cloud system I used for deploying the system is Streamlit. Streamlit is the fastest way to build and share the apps.

               Streamlit is an open source app framework used to create web pages for machine learning and data science. It’s a free tier cloud system where one can deploy their system.

           Spyder is the IDE(Integrated Development Environment) which I used for writing the code for deploying the system. It’s very easy to use streamlit library in spyder compared to other environments.

        Streamlit library is very easy to use for creating the web apps and web pages.

Pros:

  • It is free and open source.
  • Its can build apps with just dozens of python code with simple APl.
  • It works with tensorflow, keras, numpy, seaborn and many more. 

Cons:

  • Lack of design flexibility.
  • It may likely run into speed issues if the application is very large.

Some of the features from the dataset which are important are considered for deploying the system.      

 Below given is the snapshot of working demo of the system.

The above 10 features are some of the important features in the dataset which are used for training and deployment.

    Given the values in the respective boxes and then clicking the submit button gives the output that is whether the restaurant is recommended or not.

System Architecture:

The system latency is very good. It provides the output in very few seconds.

Leave a Reply

Your email address will not be published.