Our Objective

The main purpose of this project is to predict the store-wide sales for each store for the following year. Moreover, finding meaningful insights and clustering the stores based on percentages of sales in total total montly sales are objectives of this project.

About the Data

We have found a data set from Kaggle. The data set named Retail Data Analytics which is about one of the retail company’s sales.

There are historical sales data for 45 stores located in different regions - each store contains a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the.Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Within the Excel Sheet, there are 3 Tabs – Stores, Features and Sales.

Stores Anonymized information about the 45 stores, indicating the type and size of store.

Features Contains additional data related to the store, department, and regional activity for the given dates. +Store - the store number +Date - the week +Temperature - average temperature in the region +Fuel_Price - cost of fuel in the region +MarkDown1-5 - anonymized data related to promotional markdowns. Mark Down data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA +CPI - the consumer price index Unemployment - the unemployment rate +Is Holiday -whether the week is a special holiday week

Sales Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab there are the following fields: +Store - the store number +Dept - the department number +Date - the week +Weekly_Sales - sales for the given department in the given store +IsHoliday - whether the week is a special holiday week’[1]

Exploratory Analysis

Loading Libraries

library(lubridate)
library(tidyverse)
library(data.table)
library(stringr)
library(ggplot2)
library(plotly)
library(corrplot)
library(xgboost)
library(DiagrammeR)

We downloaded data set from 3 csv files.

Data Loading

  features_data_set <- read.csv2("Features data set.csv", header = TRUE, sep = ",")
  sales_data_set <- read.csv2("sales data-set.csv", header = TRUE, sep = ",")
  stores_data_set <- read.csv2("stores data-set.csv", header = TRUE, sep = ",")
  str(features_data_set)
## 'data.frame':    8190 obs. of  12 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Factor w/ 182 levels "01/02/2013","01/03/2013",..: 25 67 109 151 26 68 110 152 8 50 ...
##  $ Temperature : Factor w/ 4178 levels "-2.06","-6.08",..: 1066 864 934 1308 1300 1998 1797 1601 2315 2579 ...
##  $ Fuel_Price  : Factor w/ 1011 levels "2.472","2.513",..: 16 10 3 12 43 58 88 95 87 121 ...
##  $ MarkDown1   : Factor w/ 4023 levels "-16.93","-2781.45",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : Factor w/ 2715 levels "-0.01","-0.05",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : Factor w/ 2885 levels "-0.2","-0.73",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : Factor w/ 3405 levels "0.22","0.41",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : Factor w/ 4045 levels "-185.17","-37.02",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : Factor w/ 2505 levels "126.064","126.0766452",..: 1124 1143 1148 1150 1154 1157 1139 1115 1096 1084 ...
##  $ Unemployment: Factor w/ 404 levels "10.064","10.115",..: 286 286 286 286 286 286 286 286 252 252 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...

There are 8190 observations and 12 variables at Features data set. These data set gives the internal and external conditions of the situations.

str(sales_data_set)
## 'data.frame':    421570 obs. of  5 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Factor w/ 143 levels "01/04/2011","01/06/2012",..: 20 53 86 119 21 54 87 120 6 39 ...
##  $ Weekly_Sales: Factor w/ 359464 levels "-0.02","-0.04",..: 140272 237401 220814 101950 120302 114885 122532 148007 272542 225936 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...

There are 421570 observations and 5 variables at Sales data set. These data set gives information detailed information about sales.

str(stores_data_set)
## 'data.frame':    45 obs. of  3 variables:
##  $ Store: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Type : Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 1 2 2 ...
##  $ Size : int  151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...

There are 45 observations which are store numbers. The data shows the types and sizes of the stores. We merged the 3 data sets via store numbers.

Data type converting

We needed to see months and years at our data set. Therefore, we splitted dates by using the places of the character.

features_data_set$Year <- substr(features_data_set$Date, 7, 10)
features_data_set$Month <- substr(features_data_set$Date, 4, 5)
features_data_set$Day <- substr(features_data_set$Date, 1, 2)

sales_data_set$Year <- substr(sales_data_set$Date, 7, 10)
sales_data_set$Month <- substr(sales_data_set$Date, 4, 5)
sales_data_set$Day <- substr(sales_data_set$Date, 1, 2)

sales_data_set$Weekly_Sales <- as.character(sales_data_set$Weekly_Sales)
sales_data_set$Weekly_Sales <- as.numeric(sales_data_set$Weekly_Sales,2)

Looking at the Sales of the deparments in 3 years

YearSales <- sales_data_set %>% group_by(Year,Dept) %>% summarise(YearSales = sum(Weekly_Sales)) %>% arrange(desc(YearSales))



ggplot(head(YearSales, 60), aes(Year, YearSales)) +
  geom_col() + facet_wrap(~Dept)

We looked the yearly sales of the all stores. The highest yearly sales happened in 2011.

Analyzing the store sizes

SalesStore <- left_join(sales_data_set, stores_data_set, by = "Store")

ggplot(SalesStore, aes(Type, Size) ,log = "xy") +
  geom_point()

We wanted to know if there is a relationship between store types and sizes. As you can see the most of the stores with type A have larger sizes.

Looking at the relationship between Store Sizes & Weekly Sales

plot(SalesStore$Size,SalesStore$Weekly_Sales, main = "Size vs Sales", xlab = "Store Size", ylab = "Weekly Sales")

SalesStore <- left_join(sales_data_set, stores_data_set, by = "Store")
monthsales<-SalesStore %>% group_by(Month) %>% summarise(montlysales=sum(Weekly_Sales))
monthsales$montlysales <- as.numeric(monthsales$montlysales)


qplot(x =Month , y = montlysales,data = monthsales)

Then it was time to see the montly sales trend of the stores. The highest sales occurred on April and July.

deptSalesdata <- sales_data_set %>% group_by(Dept) %>% summarise(deptSales = sum(Weekly_Sales)) %>% arrange(desc(deptSales))
deptSalesdata$Dept<-as.factor(deptSalesdata$Dept)

deptSalesdata<-data.frame(deptSalesdata)




ggplot(deptSalesdata,aes(x=Dept,y=deptSales,fill=Dept)) +geom_bar(fill="#56B4E6", stat = "identity") + scale_x_discrete(name="Departments") + theme( axis.text.x = element_text(angle =90)) + ggtitle('Sales of the Departments')

There were 99 departments. We wanted to see the sales trend of the departments. The departments that have the highest sales are…….

features_data_set$Temperature<-as.numeric(as.vector(features_data_set$Temperature))
features_data_set$Unemployment<-as.numeric(as.vector(features_data_set$Unemployment))
features_data_set$Fuel_Price<-as.numeric(as.vector(features_data_set$Fuel_Price))

sales_data_set$Weekly_Sales<-as.numeric(as.vector(sales_data_set$Weekly_Sales))

features_temp_m <- features_data_set %>% group_by(Month) %>% summarise(ort_temp=mean(Temperature))


sales_m <- sales_data_set %>%group_by(Month) %>% summarise(ort_sa=mean(Weekly_Sales))

temp_sales <- inner_join(sales_m,features_temp_m,by="Month")

ggplot(temp_sales, aes(x = Month, y = ort_temp, size = ort_sa)) +
  geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")

We wanted to see how the external conditions affected sales. We used the temperature data from features. The highest sales accured when the temperature was lowest and highest.

features_Unem_m <- features_data_set %>% group_by(Month) %>% summarise(avg_une=mean(Unemployment))

Unem_sales <- inner_join(sales_m,features_Unem_m,by="Month")




ggplot(Unem_sales, aes(x = Month, y = avg_une, size = ort_sa)) +
  geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")

features_data_set$CPI<-as.numeric(as.vector(features_data_set$CPI))
features_CPI_ <- features_data_set %>% group_by(Month) %>% summarise(avg_cpi=mean(CPI))

cpi_sales <- inner_join(sales_m,features_CPI_,by="Month")




ggplot(cpi_sales, aes(x = Month, y = avg_cpi, size = ort_sa)) +
  geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")

PCA

SalesStorepca <- left_join(sales_data_set, stores_data_set, by = "Store")

alldatapca <- inner_join(SalesStorepca, features_data_set , by = c("Store", "Year", "Month", "Day")) 

selectcolpca <- alldatapca %>% select( Size, CPI ,Unemployment ,Fuel_Price, Weekly_Sales, Temperature)

selectcolpca$CPI <- as.numeric(as.character(selectcolpca$CPI))
selectcolpca$Unemployment <- as.numeric(as.character(selectcolpca$Unemployment))
selectcolpca$Fuel_Price <- as.numeric(as.character(selectcolpca$Fuel_Price))
selectcolpca$Weekly_Sales <- as.numeric(as.character(selectcolpca$Weekly_Sales))
selectcolpca$Temperature <- as.numeric(as.character(selectcolpca$Temperature))

prout <- prcomp(selectcolpca, center = TRUE ,scale = TRUE)

prvar <- prout$sdev^2

pve <- prvar / sum(prvar)

plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained

plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Variability of each principal component and variance explained by each principal component.

  • If we use 1 PC, 22% of the total data can be explained.
  • If we use 2 PC, 43% of the total data can be explained.
  • If we use 3 PC, 62% of the total data can be explained.
  • If we use 4 PC, 79% of the total data can be explained.
  • If we use 5 PC, 91% of the total data can be explained.
  • If we use 6 PC, 100% of the total data can be explained.

Correlation Analyze

alldata <- inner_join(SalesStore, features_data_set , by = c("Store", "Year", "Month", "Day")) 

selectcol <- alldata %>% select( Size, CPI ,Unemployment ,Fuel_Price, Weekly_Sales, Temperature)


selectcol$CPI <- as.numeric(as.character(selectcol$CPI))
selectcol$Unemployment <- as.numeric(as.character(selectcol$Unemployment))
selectcol$Fuel_Price <- as.numeric(as.character(selectcol$Fuel_Price))
selectcol$Weekly_Sales <- as.numeric(as.character(selectcol$Weekly_Sales))
selectcol$Temperature <- as.numeric(as.character(selectcol$Temperature))

matrixdata <- as.matrix(selectcol)

corrplot(cor(selectcol) ,method = "circle")

We can comfortably see that there is a strong relationship between size and weekly sales than the others features.

Looking at the percentages of the montly sales by the stores.

Sales<-data.table(sales_data_set)
Features<-data.table(features_data_set)
Stores<-data.table(stores_data_set)

Sales<-Sales[,list(Store,Dept,Date,Weekly_Sales)]

setkey(Sales,Store,Date)
setkey(Features,Store,Date)

Sales<-Features[Sales]

setkey(Sales,Store)
setkey(Stores,Store)

Sales<-Stores[Sales]

#str(Sales)
#summary(Sales)

head(Sales)
##    Store Type   Size       Date Temperature Fuel_Price MarkDown1 MarkDown2
## 1:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
## 2:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
## 3:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
## 4:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
## 5:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
## 6:     1    A 151315 01/04/2011       59.17      3.524        NA        NA
##    MarkDown3 MarkDown4 MarkDown5      CPI Unemployment IsHoliday Year
## 1:        NA        NA        NA 214.8372        7.682     FALSE 2011
## 2:        NA        NA        NA 214.8372        7.682     FALSE 2011
## 3:        NA        NA        NA 214.8372        7.682     FALSE 2011
## 4:        NA        NA        NA 214.8372        7.682     FALSE 2011
## 5:        NA        NA        NA 214.8372        7.682     FALSE 2011
## 6:        NA        NA        NA 214.8372        7.682     FALSE 2011
##    Month Day Dept Weekly_Sales
## 1:    04  01    1     20398.09
## 2:    04  01    2     46991.58
## 3:    04  01    3      8734.19
## 4:    04  01    4     34451.90
## 5:    04  01    5     23598.55
## 6:    04  01    6      3249.27

we also prepared the data with data.table package that is hepful for manupulating big data faster than dplyr.

Holiday Sales in Weekly Sales

ggplot(Sales, aes(x =Date , y =Weekly_Sales , fill = IsHoliday)) +
 geom_point(size = 2, shape = 23, color = "#4ABEFF")

We wanted to see the sales on holidays

Looking at the store numbers

ggplot(Sales, aes(Type, fill = Type )) + 

 geom_bar() + 

 xlab("Type of Store") + ylab("Count of Store")+ facet_grid(~ Year)

We had a look at count of the different types of the stores. There are more stores with Type A.

YearlySales<-Sales[,sum(Weekly_Sales,na.rm = TRUE),.(Store,Type,Size)]

setnames(YearlySales,"V1","Yearly_Sales")

ggplot(YearlySales,aes(x=Size,y=Yearly_Sales)) +
 geom_point()+
 geom_smooth(method=lm,color="RED",se = FALSE)+
 scale_x_continuous(waiver()) + scale_y_continuous(waiver())

We expected to see there is a strong relationship between the sales and the store sizes. We checked it on this graph. As you can see relationship is positive and strong.

Sales[,Month:=as.numeric(substring(as.character(Date),4,5))]
Sales[,Year:=as.numeric(substring(as.character(Date),7,10))]
Sales$Date<-dmy(Sales$Date)

Sales[,Week:=week(Date)]
Sales[,YearWeek:=as.numeric(ifelse(Week<10,paste(Year,"0",Week,sep=""),paste(Year,Week,sep="")))]
Sales[,SizeC:=ifelse(Type=="A",1,ifelse(Type=="B",2,3))]

We created year, month, week, yearweek coloumns by using data.table package.

MonthlySales<-Sales[,sum(Weekly_Sales,na.rm = TRUE),.(Store,Month)]

setnames(MonthlySales,"V1","Monthly_Sales")

MonthlySales[,TotalSales:=sum(Monthly_Sales,na.rm = TRUE),.(Month)]

MonthlySales[,SalesPercantage:=Monthly_Sales*1.0/TotalSales]

summary(MonthlySales[,SalesPercantage])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.004148 0.011775 0.020600 0.022222 0.030277 0.046391

We created MonthlySales table to see the montly sales of three years, to calculate the percantages of the store in total monthly sales.

Clustering of the stores according to the montly sales with k-means

We wanted to cluster the data to group the stores that have the same patterns.

Firstly, we looked at the percantage of the stores on the total sales at the related month. Then visualized the clustering.

Clusno<-5  #number of clusters was set.

#store/month matrix had created with dcast function.

CM=dcast.data.table(MonthlySales,Store~Month,value.var="SalesPercantage")

S<-colnames(CM)
CM<-data.frame(CM)
CM[is.na(CM)]=0    #NA's was filled with 0.
colnames(CM)<-S
CM<-data.table(CM)

head(CM)
##    Store           1           2           3           4           5
## 1:     1 0.033685490 0.034296378 0.034381160 0.033427863 0.033215731
## 2:     2 0.041689678 0.042855724 0.041703736 0.040708568 0.040735639
## 3:     3 0.008887591 0.009148494 0.008936444 0.008553617 0.008462286
## 4:     4 0.046157386 0.046072825 0.044761484 0.043421522 0.043781330
## 5:     5 0.006880421 0.006764307 0.006863682 0.007001230 0.006866335
## 6:     6 0.031829294 0.032476623 0.034275101 0.033896024 0.032960172
##              6           7           8           9          10          11
## 1: 0.032603054 0.031855466 0.032815005 0.033401891 0.033291832 0.032324483
## 2: 0.040681090 0.039530651 0.040181861 0.039947257 0.040663416 0.040842843
## 3: 0.008370371 0.008119139 0.008188921 0.008501163 0.008614132 0.008645907
## 4: 0.043081218 0.043044542 0.044210832 0.045114239 0.045680079 0.045114709
## 5: 0.006695853 0.006456076 0.006519913 0.006863876 0.006715309 0.007007696
## 6: 0.034720069 0.034801612 0.032636899 0.031545752 0.031298558 0.032979546
##             12
## 1: 0.031047761
## 2: 0.041548796
## 3: 0.008398704
## 4: 0.044399801
## 5: 0.006519450
## 6: 0.034087565
# basl<-which(colnames(rr)=="2")
# bitis<-which(colnames(rr)=="269")
set.seed(7)
CM[,clusno:=kmeans(CM[,c(2:ncol(CM)),with=F],Clusno)$cluster]  

clusters<-CM[,list(Store,clusno)]  #clusters created with k-mean.

setkey(clusters,Store)

setkey(MonthlySales,Store) 

MonthlySales<-clusters[MonthlySales] #MonthlySales and Clusters tables were merged.

SalesP<-dcast.data.table(MonthlySales,Month~Store,value.var="SalesPercantage") #store-month matrix had created and filled with sales percantages

MonthlySales$Month <- factor(MonthlySales$Month)
MonthlySales$Store <- factor(MonthlySales$Store)
MonthlySales$clusno <- factor(MonthlySales$clusno)

#month, store, clusno variables were converted as factor.
# plotting reference lines across each facet:

referenceLines <- MonthlySales  # \/ Rename
colnames(referenceLines)[2] <- "groupVar"
zp <- ggplot(MonthlySales,
             aes(x = Month, y = SalesPercantage))
zp <- zp + geom_line(data = referenceLines,  # Plotting the "underlayer"
                     aes(x = Month, y = SalesPercantage, group = groupVar),
                     colour = "BLUE", alpha = 1/2, size = 1/2)
zp <- zp + geom_line(size = 1)  # Drawing the "overlayer"
zp <- zp + facet_wrap(~ Store)
zp <- zp + theme_bw()
ggplotly()
plot<-ggplot(MonthlySales, aes(x=Month, y=SalesPercantage, color=clusno, group=Store)) +
  geom_line()
ggplotly()

In this graph every line refers to a store and line shows the monthly sales trend. Everline with the same colour refers to a cluster group.

Features <- read.csv("Features data set.csv", header = TRUE, sep = ",")
Sales <- read.csv("sales data-set.csv", header = TRUE, sep = ",")
Stores <- read.csv("stores data-set.csv", header = TRUE, sep = ",")

Sales<-data.table(Sales)
Features<-data.table(Features)
Stores<-data.table(Stores)

Sales<-Sales[,list(Store,Dept,Date,Weekly_Sales)]

setkey(Sales,Store,Date)
setkey(Features,Store,Date)

Sales<-Features[Sales]

setkey(Sales,Store)
setkey(Stores,Store)

Sales<-Stores[Sales]

str(Sales)
## Classes 'data.table' and 'data.frame':   421570 obs. of  16 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Type        : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Date        : Factor w/ 143 levels "01/04/2011","01/06/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Temperature : num  59.2 59.2 59.2 59.2 59.2 ...
##  $ Fuel_Price  : num  3.52 3.52 3.52 3.52 3.52 ...
##  $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : num  215 215 215 215 215 ...
##  $ Unemployment: num  7.68 7.68 7.68 7.68 7.68 ...
##  $ IsHoliday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Dept        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Weekly_Sales: num  20398 46992 8734 34452 23599 ...
##  - attr(*, "sorted")= chr "Store"
##  - attr(*, ".internal.selfref")=<externalptr>
summary(Sales)
##      Store      Type            Size                Date       
##  Min.   : 1.0   A:215478   Min.   : 34875   23/12/2011:  3027  
##  1st Qu.:11.0   B:163495   1st Qu.: 93638   25/11/2011:  3021  
##  Median :22.0   C: 42597   Median :140167   16/12/2011:  3013  
##  Mean   :22.2              Mean   :136728   09/12/2011:  3010  
##  3rd Qu.:33.0              3rd Qu.:202505   17/02/2012:  3007  
##  Max.   :45.0              Max.   :219622   30/12/2011:  3003  
##                                             (Other)   :403489  
##   Temperature       Fuel_Price      MarkDown1          MarkDown2       
##  Min.   : -2.06   Min.   :2.472   Min.   :    0.27   Min.   :  -265.8  
##  1st Qu.: 46.68   1st Qu.:2.933   1st Qu.: 2240.27   1st Qu.:    41.6  
##  Median : 62.09   Median :3.452   Median : 5347.45   Median :   192.0  
##  Mean   : 60.09   Mean   :3.361   Mean   : 7246.42   Mean   :  3334.6  
##  3rd Qu.: 74.28   3rd Qu.:3.738   3rd Qu.: 9210.90   3rd Qu.:  1926.9  
##  Max.   :100.14   Max.   :4.468   Max.   :88646.76   Max.   :104519.5  
##                                   NA's   :270889     NA's   :310322    
##    MarkDown3           MarkDown4          MarkDown5             CPI       
##  Min.   :   -29.10   Min.   :    0.22   Min.   :   135.2   Min.   :126.1  
##  1st Qu.:     5.08   1st Qu.:  504.22   1st Qu.:  1878.4   1st Qu.:132.0  
##  Median :    24.60   Median : 1481.31   Median :  3359.4   Median :182.3  
##  Mean   :  1439.42   Mean   : 3383.17   Mean   :  4629.0   Mean   :171.2  
##  3rd Qu.:   103.99   3rd Qu.: 3595.04   3rd Qu.:  5563.8   3rd Qu.:212.4  
##  Max.   :141630.61   Max.   :67474.85   Max.   :108519.3   Max.   :227.2  
##  NA's   :284479      NA's   :286603     NA's   :270138                    
##   Unemployment    IsHoliday            Dept        Weekly_Sales   
##  Min.   : 3.879   Mode :logical   Min.   : 1.00   Min.   : -4989  
##  1st Qu.: 6.891   FALSE:391909    1st Qu.:18.00   1st Qu.:  2080  
##  Median : 7.866   TRUE :29661     Median :37.00   Median :  7612  
##  Mean   : 7.960                   Mean   :44.26   Mean   : 15981  
##  3rd Qu.: 8.572                   3rd Qu.:74.00   3rd Qu.: 20206  
##  Max.   :14.313                   Max.   :99.00   Max.   :693099  
## 

Creating new variables(Month,Year)

Sales[,Month:=as.numeric(substring(as.character(Date),4,5))]
Sales[,Year:=as.numeric(substring(as.character(Date),7,10))]

Converting the date variable into date.format type with lubridate package

Sales$Date<-dmy(Sales$Date)
str(Sales$Date)
##  Date[1:421570], format: "2011-04-01" "2011-04-01" "2011-04-01" "2011-04-01" "2011-04-01" ...

Creating new variables(Week,YearWeek,SizeC)

Sales[,Week:=week(Date)]
Sales[,YearWeek:=as.numeric(ifelse(Week<10,paste(Year,"0",Week,sep=""),paste(Year,Week,sep="")))]
Sales[,SizeC:=ifelse(Type=="A",1,ifelse(Type=="B",2,3))]
ggplot(Sales[Type=="A"& Store==4,],
      aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
 geom_smooth(size = 2) + facet_wrap(~Store) +
 theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(Sales[Type=="B" & Store==10,],
      aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
 geom_smooth(size = 5) + facet_wrap(~Store) +
 theme(axis.text.x = element_text(angle = 90))

ggplot(Sales[Type=="C"& Store==30,],
      aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
 geom_smooth(size = 2) + facet_wrap(~Store) +
 theme(axis.text.x = element_text(angle = 90, hjust = 1))

Aggrigating the sales to Week and Store level

SalesWeeklybyStores<-Sales[,sum(Weekly_Sales),.(Year,Month,Week,YearWeek,Store,Type,Size,Temperature,Fuel_Price,CPI,Unemployment,IsHoliday)]

Converting the ISHoliday variable from logical to binary

SalesWeeklybyStores[,IsHoliday:=ifelse(IsHoliday=="FALSE",0,1)]

setnames(SalesWeeklybyStores,"V1","WeeklySales")

str(SalesWeeklybyStores)
## Classes 'data.table' and 'data.frame':   6435 obs. of  13 variables:
##  $ Year        : num  2011 2012 2011 2010 2012 ...
##  $ Month       : num  4 6 7 10 3 4 7 9 12 2 ...
##  $ Week        : int  14 22 27 40 9 14 27 36 49 5 ...
##  $ YearWeek    : num  201114 201222 201127 201040 201209 ...
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Type        : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
##  $ Temperature : num  59.2 78 85.5 71.9 61 ...
##  $ Fuel_Price  : num  3.52 3.5 3.52 2.6 3.63 ...
##  $ CPI         : num  215 222 215 212 221 ...
##  $ Unemployment: num  7.68 7.14 7.96 7.84 7.35 ...
##  $ IsHoliday   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ WeeklySales : num  1495065 1624478 1488538 1453330 1688421 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Creating new features with lag of sales,fuelprice and temperature

setkey(SalesWeeklybyStores,Store,YearWeek)
SalesWeeklybyStores[, Weekly_Sales_lag1:=dplyr::lag(WeeklySales,1),.(Store)]
SalesWeeklybyStores[, Fuel_Price_lag1:=dplyr::lag(Fuel_Price,1),.(Store)]
SalesWeeklybyStores[, Temperature_lag1:=dplyr::lag(Temperature,1),.(Store)]

Setting the train and test data sets

trainset = SalesWeeklybyStores[YearWeek<=201215 &YearWeek>201014,]
testset = SalesWeeklybyStores[YearWeek>201215,]
testset[, id := .I]

summary(trainset) 
##       Year          Month             Week          YearWeek     
##  Min.   :2010   Min.   : 1.000   Min.   : 1.00   Min.   :201015  
##  1st Qu.:2010   1st Qu.: 4.000   1st Qu.:14.00   1st Qu.:201041  
##  Median :2011   Median : 7.000   Median :27.00   Median :201116  
##  Mean   :2011   Mean   : 6.566   Mean   :27.13   Mean   :201104  
##  3rd Qu.:2011   3rd Qu.:10.000   3rd Qu.:40.00   3rd Qu.:201142  
##  Max.   :2012   Max.   :12.000   Max.   :53.00   Max.   :201215  
##      Store    Type          Size         Temperature       Fuel_Price   
##  Min.   : 1   A:2332   Min.   : 34875   Min.   : -2.06   Min.   :2.513  
##  1st Qu.:12   B:1802   1st Qu.: 70713   1st Qu.: 45.68   1st Qu.:2.903  
##  Median :23   C: 636   Median :126512   Median : 60.52   Median :3.308  
##  Mean   :23            Mean   :130288   Mean   : 59.19   Mean   :3.308  
##  3rd Qu.:34            3rd Qu.:202307   3rd Qu.: 73.17   3rd Qu.:3.675  
##  Max.   :45            Max.   :219622   Max.   :100.14   Max.   :4.294  
##       CPI         Unemployment      IsHoliday        WeeklySales     
##  Min.   :126.1   Min.   : 4.125   Min.   :0.00000   Min.   : 209986  
##  1st Qu.:132.1   1st Qu.: 7.082   1st Qu.:0.00000   1st Qu.: 554929  
##  Median :182.6   Median : 7.943   Median :0.00000   Median : 954629  
##  Mean   :170.8   Mean   : 8.141   Mean   :0.07547   Mean   :1049687  
##  3rd Qu.:211.7   3rd Qu.: 8.622   3rd Qu.:0.00000   3rd Qu.:1417454  
##  Max.   :225.3   Max.   :14.313   Max.   :1.00000   Max.   :3818686  
##  Weekly_Sales_lag1 Fuel_Price_lag1 Temperature_lag1
##  Min.   : 209986   Min.   :2.513   Min.   : -2.06  
##  1st Qu.: 554929   1st Qu.:2.886   1st Qu.: 45.64  
##  Median : 955465   Median :3.290   Median : 60.45  
##  Mean   :1050483   Mean   :3.297   Mean   : 59.14  
##  3rd Qu.:1418001   3rd Qu.:3.664   3rd Qu.: 73.16  
##  Max.   :3818686   Max.   :4.294   Max.   :100.14

Creating the performance measurement function (RMPSE)

RMPSE<- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  elab<-exp(as.numeric(labels))-1
  epreds<-exp(as.numeric(preds))-1
  err <- sqrt(mean((epreds/elab-1)^2))
  return(list(metric = "RMPSE", value = err))
}

Setting a sample size for training

nrow(SalesWeeklybyStores)
## [1] 6435
h<-sample(nrow(trainset),600)

Crating the gbm matrix. Added 1 to sales because of log(0) goes inf

dval<-xgb.DMatrix(data=data.matrix(trainset[h,]),label=log(trainset$WeeklySales+1)[h])
dtrain<-xgb.DMatrix(data=data.matrix(trainset[-h,]),label=log(trainset$WeeklySales+1)[-h])
watchlist<-list(val=dval,trainset=dtrain)

Setting the parameters

param <- list(  objective           = "reg:linear",
                booster = "gbtree",
                eta                 = 0.25, # 0.05, #0.1,
                max_depth           = 5, 
                subsample           = 0.7, 
                colsample_bytree    = 0.7 

)

Finding the best seed number for maximum accuracy

selectseed<-function(a,b){

bestseed2 = data.table(s = 201, x = 1.1, y = 1.1)

for (s in a:b){

  set.seed(s)
  set.seed(s)

xgbmodel <- xgb.train(   params         = param,
                    data                = dtrain,
                    nrounds             = 30, #10, #20, #30, #100, # changed from 30
                    verbose             = 0,
                    early.stop.round    = 25,
                    watchlist           = watchlist,
                    maximize            = FALSE,
                    feval=RMPSE
)
testset<-data.frame(testset)

feature.names <- names(testset)[-c(13,17)]

forecasts <- exp(predict(xgbmodel, data.matrix(testset[,feature.names]))) -1

forecasts<-data.table(forecasts)

submission <- data.frame(id=testset$id, Sales=forecasts)

submission<-data.table(submission)
testset<-data.table(testset)

setkey(submission,id)
setkey(testset,id)

final<-testset[submission]

final[,Err:=abs(forecasts-WeeklySales)/WeeklySales]
final2<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store)]

setnames(final2,"V1","TWS")
setnames(final2,"V2","TE")
setnames(final2,"V3","TP")

x<-final2[,sum(TE)/sum(TWS)]
y<-final[,mean(Err)]

bestseed<-data.table(s,x,y)
bestseed2<-rbind(bestseed,bestseed2)

}

seed<-head(bestseed2[order(x)],1)[]$s
return(seed)
}

Setting the best seed the number between a to b

a=1
b=60
c = selectseed(a,b)
c
## [1] 43

Setting the seed number

set.seed(c)
set.seed(c)

Training the model

xgbmodel <- xgb.train(   params              = param,
                    data                = dtrain,
                    nrounds             = 30, #10, #20, #30, #100, # changed from 30
                    verbose             = 1,
                    early.stop.round    = 25,
                    watchlist           = watchlist,
                    maximize            = FALSE,
                    feval=RMPSE
)
## [1]  val-RMPSE:0.999942  trainset-RMPSE:0.999945 
## Multiple eval metrics are present. Will use trainset_RMPSE for early stopping.
## Will train until trainset_RMPSE hasn't improved in 25 rounds.
## 
## [2]  val-RMPSE:0.999333  trainset-RMPSE:0.999364 
## [3]  val-RMPSE:0.995876  trainset-RMPSE:0.996033 
## [4]  val-RMPSE:0.983601  trainset-RMPSE:0.984199 
## [5]  val-RMPSE:0.954389  trainset-RMPSE:0.955691 
## [6]  val-RMPSE:0.901670  trainset-RMPSE:0.903903 
## [7]  val-RMPSE:0.824705  trainset-RMPSE:0.827982 
## [8]  val-RMPSE:0.729323  trainset-RMPSE:0.733496 
## [9]  val-RMPSE:0.624876  trainset-RMPSE:0.629821 
## [10] val-RMPSE:0.521029  trainset-RMPSE:0.526172 
## [11] val-RMPSE:0.424459  trainset-RMPSE:0.429680 
## [12] val-RMPSE:0.339660  trainset-RMPSE:0.344853 
## [13] val-RMPSE:0.267867  trainset-RMPSE:0.272932 
## [14] val-RMPSE:0.209317  trainset-RMPSE:0.213797 
## [15] val-RMPSE:0.162793  trainset-RMPSE:0.167009 
## [16] val-RMPSE:0.126050  trainset-RMPSE:0.129664 
## [17] val-RMPSE:0.097588  trainset-RMPSE:0.100406 
## [18] val-RMPSE:0.075705  trainset-RMPSE:0.078255 
## [19] val-RMPSE:0.059313  trainset-RMPSE:0.061563 
## [20] val-RMPSE:0.046990  trainset-RMPSE:0.048967 
## [21] val-RMPSE:0.038121  trainset-RMPSE:0.039693 
## [22] val-RMPSE:0.032100  trainset-RMPSE:0.033313 
## [23] val-RMPSE:0.027401  trainset-RMPSE:0.028277 
## [24] val-RMPSE:0.023985  trainset-RMPSE:0.024712 
## [25] val-RMPSE:0.021600  trainset-RMPSE:0.022039 
## [26] val-RMPSE:0.020632  trainset-RMPSE:0.020770 
## [27] val-RMPSE:0.019103  trainset-RMPSE:0.019047 
## [28] val-RMPSE:0.018530  trainset-RMPSE:0.018302 
## [29] val-RMPSE:0.017533  trainset-RMPSE:0.017323 
## [30] val-RMPSE:0.017169  trainset-RMPSE:0.016820
testset<-data.frame(testset)

Setting the feature names in test dataset

feature.names <- names(testset)[-c(13,17)]
feature.names
##  [1] "Year"              "Month"             "Week"             
##  [4] "YearWeek"          "Store"             "Type"             
##  [7] "Size"              "Temperature"       "Fuel_Price"       
## [10] "CPI"               "Unemployment"      "IsHoliday"        
## [13] "Weekly_Sales_lag1" "Fuel_Price_lag1"   "Temperature_lag1"

Generating the forecast with predict function and the exp function converts the values log to normal

forecasts <- exp(predict(xgbmodel, data.matrix(testset[,feature.names]))) -1
forecasts<-data.table(forecasts)

Crating the final table which obtains forecasts and actual values

submission <- data.frame(id=testset$id, Sales=forecasts)

submission<-data.table(submission)
testset<-data.table(testset)

setkey(submission,id)
setkey(testset,id)

final<-testset[submission]

Measuring the performance

final[,Err:=abs(forecasts-WeeklySales)/WeeklySales]
final2<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store)]
final3<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store,YearWeek)]

setnames(final2,"V1","TWS") #Total Weekly Sales
setnames(final2,"V2","TE")  #Total Errors
setnames(final2,"V3","TF")  #Total Forecasts

setnames(final3,"V1","TWS") #Total Weekly Sales
setnames(final3,"V2","TE")  #Total Errors
setnames(final3,"V3","TF")  #Total Forecasts

Store and year week percentage errors of forecasts

final3[,PE:=sum(TE)/sum(TWS),.(Store,YearWeek,TWS,TE,TF)]
final3[,list(Store,PE)]
##       Store          PE
##    1:     1 0.047429468
##    2:     1 0.010801324
##    3:     1 0.131069442
##    4:     1 0.016546445
##    5:     1 0.001351193
##   ---                  
## 1256:    45 0.016677900
## 1257:    45 0.014999523
## 1258:    45 0.012793222
## 1259:    45 0.009667758
## 1260:    45 0.049755044
plot1<-ggplot(final3, aes(YearWeek)) + 
  geom_line(aes(y = PE, colour = "PE")) + 
  facet_wrap(~final3$Store) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
plot2<-ggplot(final3, 
       aes(x = YearWeek, y = PE)) + 
  geom_smooth(size = 0.01) + facet_wrap(~Store) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly
## function (p = ggplot2::last_plot(), width = NULL, height = NULL, 
##     tooltip = "all", dynamicTicks = FALSE, layerData = 1, originalData = TRUE, 
##     source = "A", ...) 
## {
##     UseMethod("ggplotly", p)
## }
## <environment: namespace:plotly>

Store percentage errors of forecasts

final2[,PE:=sum(TE)/sum(TWS),.(Store)]
final2[,list(Store,PE)]
##     Store         PE
##  1:     1 0.05190645
##  2:     2 0.04212371
##  3:     3 0.03745798
##  4:     4 0.03448086
##  5:     5 0.04854706
##  6:     6 0.05098826
##  7:     7 0.05915099
##  8:     8 0.03512490
##  9:     9 0.04346671
## 10:    10 0.03231090
## 11:    11 0.05870505
## 12:    12 0.05148974
## 13:    13 0.03731989
## 14:    14 0.06185494
## 15:    15 0.04894954
## 16:    16 0.06033702
## 17:    17 0.07136176
## 18:    18 0.05285955
## 19:    19 0.05191724
## 20:    20 0.04168502
## 21:    21 0.04170725
## 22:    22 0.04721929
## 23:    23 0.05292478
## 24:    24 0.06237871
## 25:    25 0.03578157
## 26:    26 0.04917069
## 27:    27 0.05101435
## 28:    28 0.11176581
## 29:    29 0.05832922
## 30:    30 0.02661010
## 31:    31 0.03638163
## 32:    32 0.03881713
## 33:    33 0.07282118
## 34:    34 0.03125023
## 35:    35 0.05379618
## 36:    36 0.04295618
## 37:    37 0.03514427
## 38:    38 0.06851168
## 39:    39 0.04595508
## 40:    40 0.07017110
## 41:    41 0.04371120
## 42:    42 0.08121408
## 43:    43 0.04285009
## 44:    44 0.03047717
## 45:    45 0.04182988
##     Store         PE

Total percentage error of forecasts whole test data

final[,mean(Err)]
## [1] 0.04928789

Printing top 10 nodes of the model

model <- xgb.dump(xgbmodel, with.stats = T)
model[1:10] 
##  [1] "booster[0]"                                                       
##  [2] "0:[f12<797364] yes=1,no=2,missing=1,gain=555.558,cover=2873"      
##  [3] "1:leaf=3.14284,cover=1103"                                        
##  [4] "2:leaf=3.40188,cover=1770"                                        
##  [5] "booster[1]"                                                       
##  [6] "0:[f12<1.01431e+006] yes=1,no=2,missing=1,gain=357.353,cover=2937"
##  [7] "1:[f12<460641] yes=3,no=4,missing=3,gain=37.5645,cover=1576"      
##  [8] "3:leaf=2.27091,cover=461"                                         
##  [9] "4:leaf=2.42762,cover=1115"                                        
## [10] "2:leaf=2.58059,cover=1361"

Printing the importance matrix

names <- dimnames(data.matrix(testset[,-c(13,17)]))[[2]]
importance_matrix <- xgb.importance(names, model = xgbmodel)
xgb.plot.importance(importance_matrix[1:5,])

Printing the first 10 tree

xgb.plot.tree(model = xgbmodel, n_first_tree = 10) 

Plotting the actual sales and forecasts

z<-ggplot(final, aes(YearWeek)) + 
  geom_line(aes(y = WeeklySales, colour = "WeeklySales")) + 
  geom_line(aes(y = forecasts, colour = "forecasts")) +
  facet_wrap(~final$Store) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()

###References #_Retail Data Analytics. (2017, August). Retrieved from https://www.kaggle.com/manjeetsingh/retaildataset_ #_XGBoost w/Custom Error Function https://www.kaggle.com/chipmonkey/obligatory-xgboost-example