Big Mart Sales Prediction

DataMunglers

19/12/2017

Summary of Big Mart Dataset

## Observations: 8,523
## Variables: 12
## $ Item_Identifier           <fctr> FDA15, DRC01, FDN15, FDX07, NCD19, ...
## $ Item_Weight               <dbl> 9.300, 5.920, 17.500, 19.200, 8.930,...
## $ Item_Fat_Content          <fctr> Low Fat, Regular, Low Fat, Regular,...
## $ Item_Visibility           <dbl> 0.016047301, 0.019278216, 0.01676007...
## $ Item_Type                 <fctr> Dairy, Soft Drinks, Meat, Fruits an...
## $ Item_MRP                  <dbl> 249.8092, 48.2692, 141.6180, 182.095...
## $ Outlet_Identifier         <fctr> OUT049, OUT018, OUT049, OUT010, OUT...
## $ Outlet_Establishment_Year <int> 1999, 2009, 1999, 1998, 1987, 2009, ...
## $ Outlet_Size               <fctr> Medium, Medium, Medium, , High, Med...
## $ Outlet_Location_Type      <fctr> Tier 1, Tier 3, Tier 1, Tier 3, Tie...
## $ Outlet_Type               <fctr> Supermarket Type1, Supermarket Type...
## $ Item_Outlet_Sales         <dbl> 3735.1380, 443.4228, 2097.2700, 732....

Data Manipulation

Reorgonize Item Fat Content Column

train %>%
  group_by(Item_Fat_Content) %>%
  summarise(Count = n(),Perc=round(n()/nrow(.)*100,2)) %>%
  arrange(desc(Count))
## # A tibble: 5 x 3
##   Item_Fat_Content Count  Perc
##             <fctr> <int> <dbl>
## 1          Low Fat  5089 59.71
## 2          Regular  2889 33.90
## 3               LF   316  3.71
## 4              reg   117  1.37
## 5          low fat   112  1.31

Data Manipulation

Distribution of Item Fat Content Column After Cleaning

Data Manipulation

Created Two New Column From Item Identifier

For Example FDA15 to FD and 15

## [1] FDA15 DRC01 FDN15 FDX07 NCD19 FDP36
## 1559 Levels: DRA12 DRA24 DRA59 DRB01 DRB13 DRB24 DRB25 DRB48 DRC01 ... NCZ54
## 
##    DR    FD    NC 
##  1317 10201  2686
##  [1] 15  1 15  7 19 36 10 10 17 28  7  3 32 46 32 49 42 49 11  2

Distribution of new string columns and some variables from new numerical column.

Data Manipulation

Looking Item Type Column According to New Column

Data Visualization

Looking Outlet Identifier Sales According to Outlet Type

Data Visualization

Item Outlet Sales Distribution

Data Visualization

Item Outlet Sales vs Item Type

Data Visualization

Looking at Scatter Plots between Numerical Data Columns and Outlet Sales

Data Visualization

Changing Categorical Columns to Numericals

## Observations: 8,523
## Variables: 20
## $ Item_Weight                     <dbl> 9.300, 5.920, 17.500, 19.200, ...
## $ Item_Visibility                 <dbl> 0.016047301, 0.019278216, 0.01...
## $ Item_MRP                        <dbl> 249.8092, 48.2692, 141.6180, 1...
## $ Outlet_Size_                    <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, ...
## $ Outlet_Size_High                <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, ...
## $ Outlet_Size_Medium              <int> 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, ...
## $ Outlet_Size_Small               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Outlet_Location_Type_Tier 1`   <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Outlet_Location_Type_Tier 2`   <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ `Outlet_Location_Type_Tier 3`   <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, ...
## $ `Outlet_Type_Grocery Store`     <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ `Outlet_Type_Supermarket Type1` <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, ...
## $ `Outlet_Type_Supermarket Type2` <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ `Outlet_Type_Supermarket Type3` <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ Item_Outlet_Sales               <dbl> 3735.1380, 443.4228, 2097.2700...
## $ Item_Identifier_Str2_DR         <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Item_Identifier_Str2_FD         <int> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, ...
## $ Item_Identifier_Str2_NC         <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ Item_Identifier_Num             <dbl> 15, 1, 15, 7, 19, 36, 10, 10, ...
## $ Outlet_Age                      <dbl> 14, 4, 14, 15, 26, 4, 26, 28, ...

Data Visualization

Correlation Matrix of Changed Dataset

Modeling

Summary of Sqrt Regression on Bulk Dataset

## 
## Call:
## lm(formula = Item_Outlet_Sales ~ ., data = splitted_train_simple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3827.6  -672.3   -90.3   573.5  7927.6 
## 
## Coefficients: (4 not defined because of singularities)
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.184e+03  3.867e+02   5.648 1.70e-08 ***
## Item_Weight                      6.432e-02  3.438e+00   0.019   0.9851    
## Item_Visibility                 -2.525e+02  3.154e+02  -0.801   0.4234    
## Item_MRP                         1.556e+01  2.339e-01  66.517  < 2e-16 ***
## Outlet_Size_                    -1.346e+02  5.447e+01  -2.472   0.0135 *  
## Outlet_Size_High                 4.615e+02  3.023e+02   1.527   0.1269    
## Outlet_Size_Medium               3.940e+00  6.631e+01   0.059   0.9526    
## Outlet_Size_Small                       NA         NA      NA       NA    
## `Outlet_Location_Type_Tier 1`    1.649e+02  1.839e+02   0.897   0.3699    
## `Outlet_Location_Type_Tier 2`    8.156e+01  1.201e+02   0.679   0.4969    
## `Outlet_Location_Type_Tier 3`           NA         NA      NA       NA    
## `Outlet_Type_Grocery Store`     -3.532e+03  2.107e+02 -16.762  < 2e-16 ***
## `Outlet_Type_Supermarket Type1` -1.930e+03  3.484e+02  -5.540 3.16e-08 ***
## `Outlet_Type_Supermarket Type2` -2.458e+03  3.038e+02  -8.091 7.12e-16 ***
## `Outlet_Type_Supermarket Type3`         NA         NA      NA       NA    
## Item_Identifier_Str2_DR          2.824e+01  5.838e+01   0.484   0.6286    
## Item_Identifier_Str2_FD          5.130e+01  3.753e+01   1.367   0.1717    
## Item_Identifier_Str2_NC                 NA         NA      NA       NA    
## Item_Identifier_Num              3.372e+00  8.418e-01   4.006 6.26e-05 ***
## Outlet_Age                      -2.942e+01  1.239e+01  -2.375   0.0176 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1123 on 5950 degrees of freedom
## Multiple R-squared:  0.5631, Adjusted R-squared:  0.562 
## F-statistic: 511.2 on 15 and 5950 DF,  p-value: < 2.2e-16

Modeling

Formulas of Regressions on Numerical Dataset(p < 0.05)

## Item_Outlet_Sales ~ Item_MRP + Outlet_Size_ + `Outlet_Type_Grocery Store` + 
##     `Outlet_Type_Supermarket Type1` + `Outlet_Type_Supermarket Type2` + 
##     Item_Identifier_Num + Outlet_Age
## log10(Item_Outlet_Sales) ~ Item_MRP + Outlet_Size_ + Outlet_Size_High + 
##     `Outlet_Location_Type_Tier 1` + `Outlet_Location_Type_Tier 2` + 
##     `Outlet_Type_Grocery Store` + `Outlet_Type_Supermarket Type1` + 
##     `Outlet_Type_Supermarket Type2` + Item_Identifier_Num + Outlet_Age
## sqrt(Item_Outlet_Sales) ~ Item_MRP + Outlet_Size_ + `Outlet_Type_Grocery Store` + 
##     `Outlet_Type_Supermarket Type1` + `Outlet_Type_Supermarket Type2` + 
##     Item_Identifier_Num + Outlet_Age

Modeling

Formulas of RMSE(Root Mean Square Error) and MAE(Mean Absolute Error)

## [1] "MAE"
## function(actual, predicted){mean(abs(actual - predicted))}
## [1] "RMSE"
## function(actual, predicted)  {sqrt(mean((actual - predicted)^2))}

Modeling

Decision tree of datas which does not use in models 3(p >= 0.05)

Modeling

Summary of Models