The main purpose of this project is to predict the store-wide sales for each store for the following year. Moreover, finding meaningful insights and clustering the stores based on percentages of sales in total total montly sales are objectives of this project.
We have found a data set from Kaggle. The data set named Retail Data Analytics which is about one of the retail company’s sales.
There are historical sales data for 45 stores located in different regions - each store contains a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the.Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Within the Excel Sheet, there are 3 Tabs – Stores, Features and Sales.
Stores Anonymized information about the 45 stores, indicating the type and size of store.
Features Contains additional data related to the store, department, and regional activity for the given dates. +Store - the store number +Date - the week +Temperature - average temperature in the region +Fuel_Price - cost of fuel in the region +MarkDown1-5 - anonymized data related to promotional markdowns. Mark Down data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA +CPI - the consumer price index Unemployment - the unemployment rate +Is Holiday -whether the week is a special holiday week
Sales Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab there are the following fields: +Store - the store number +Dept - the department number +Date - the week +Weekly_Sales - sales for the given department in the given store +IsHoliday - whether the week is a special holiday week’[1]
library(lubridate)
library(tidyverse)
library(data.table)
library(stringr)
library(ggplot2)
library(plotly)
library(corrplot)
library(xgboost)
library(DiagrammeR)
We downloaded data set from 3 csv files.
features_data_set <- read.csv2("Features data set.csv", header = TRUE, sep = ",")
sales_data_set <- read.csv2("sales data-set.csv", header = TRUE, sep = ",")
stores_data_set <- read.csv2("stores data-set.csv", header = TRUE, sep = ",")
str(features_data_set)
## 'data.frame': 8190 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Factor w/ 182 levels "01/02/2013","01/03/2013",..: 25 67 109 151 26 68 110 152 8 50 ...
## $ Temperature : Factor w/ 4178 levels "-2.06","-6.08",..: 1066 864 934 1308 1300 1998 1797 1601 2315 2579 ...
## $ Fuel_Price : Factor w/ 1011 levels "2.472","2.513",..: 16 10 3 12 43 58 88 95 87 121 ...
## $ MarkDown1 : Factor w/ 4023 levels "-16.93","-2781.45",..: NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : Factor w/ 2715 levels "-0.01","-0.05",..: NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : Factor w/ 2885 levels "-0.2","-0.73",..: NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : Factor w/ 3405 levels "0.22","0.41",..: NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : Factor w/ 4045 levels "-185.17","-37.02",..: NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : Factor w/ 2505 levels "126.064","126.0766452",..: 1124 1143 1148 1150 1154 1157 1139 1115 1096 1084 ...
## $ Unemployment: Factor w/ 404 levels "10.064","10.115",..: 286 286 286 286 286 286 286 286 252 252 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
There are 8190 observations and 12 variables at Features data set. These data set gives the internal and external conditions of the situations.
str(sales_data_set)
## 'data.frame': 421570 obs. of 5 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Factor w/ 143 levels "01/04/2011","01/06/2012",..: 20 53 86 119 21 54 87 120 6 39 ...
## $ Weekly_Sales: Factor w/ 359464 levels "-0.02","-0.04",..: 140272 237401 220814 101950 120302 114885 122532 148007 272542 225936 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
There are 421570 observations and 5 variables at Sales data set. These data set gives information detailed information about sales.
str(stores_data_set)
## 'data.frame': 45 obs. of 3 variables:
## $ Store: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Type : Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 1 2 2 ...
## $ Size : int 151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...
There are 45 observations which are store numbers. The data shows the types and sizes of the stores. We merged the 3 data sets via store numbers.
We needed to see months and years at our data set. Therefore, we splitted dates by using the places of the character.
features_data_set$Year <- substr(features_data_set$Date, 7, 10)
features_data_set$Month <- substr(features_data_set$Date, 4, 5)
features_data_set$Day <- substr(features_data_set$Date, 1, 2)
sales_data_set$Year <- substr(sales_data_set$Date, 7, 10)
sales_data_set$Month <- substr(sales_data_set$Date, 4, 5)
sales_data_set$Day <- substr(sales_data_set$Date, 1, 2)
sales_data_set$Weekly_Sales <- as.character(sales_data_set$Weekly_Sales)
sales_data_set$Weekly_Sales <- as.numeric(sales_data_set$Weekly_Sales,2)
YearSales <- sales_data_set %>% group_by(Year,Dept) %>% summarise(YearSales = sum(Weekly_Sales)) %>% arrange(desc(YearSales))
ggplot(head(YearSales, 60), aes(Year, YearSales)) +
geom_col() + facet_wrap(~Dept)
We looked the yearly sales of the all stores. The highest yearly sales happened in 2011.
SalesStore <- left_join(sales_data_set, stores_data_set, by = "Store")
ggplot(SalesStore, aes(Type, Size) ,log = "xy") +
geom_point()
We wanted to know if there is a relationship between store types and sizes. As you can see the most of the stores with type A have larger sizes.
plot(SalesStore$Size,SalesStore$Weekly_Sales, main = "Size vs Sales", xlab = "Store Size", ylab = "Weekly Sales")
SalesStore <- left_join(sales_data_set, stores_data_set, by = "Store")
monthsales<-SalesStore %>% group_by(Month) %>% summarise(montlysales=sum(Weekly_Sales))
monthsales$montlysales <- as.numeric(monthsales$montlysales)
qplot(x =Month , y = montlysales,data = monthsales)
Then it was time to see the montly sales trend of the stores. The highest sales occurred on April and July.
deptSalesdata <- sales_data_set %>% group_by(Dept) %>% summarise(deptSales = sum(Weekly_Sales)) %>% arrange(desc(deptSales))
deptSalesdata$Dept<-as.factor(deptSalesdata$Dept)
deptSalesdata<-data.frame(deptSalesdata)
ggplot(deptSalesdata,aes(x=Dept,y=deptSales,fill=Dept)) +geom_bar(fill="#56B4E6", stat = "identity") + scale_x_discrete(name="Departments") + theme( axis.text.x = element_text(angle =90)) + ggtitle('Sales of the Departments')
There were 99 departments. We wanted to see the sales trend of the departments. The departments that have the highest sales are…….
features_data_set$Temperature<-as.numeric(as.vector(features_data_set$Temperature))
features_data_set$Unemployment<-as.numeric(as.vector(features_data_set$Unemployment))
features_data_set$Fuel_Price<-as.numeric(as.vector(features_data_set$Fuel_Price))
sales_data_set$Weekly_Sales<-as.numeric(as.vector(sales_data_set$Weekly_Sales))
features_temp_m <- features_data_set %>% group_by(Month) %>% summarise(ort_temp=mean(Temperature))
sales_m <- sales_data_set %>%group_by(Month) %>% summarise(ort_sa=mean(Weekly_Sales))
temp_sales <- inner_join(sales_m,features_temp_m,by="Month")
ggplot(temp_sales, aes(x = Month, y = ort_temp, size = ort_sa)) +
geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")
We wanted to see how the external conditions affected sales. We used the temperature data from features. The highest sales accured when the temperature was lowest and highest.
features_Unem_m <- features_data_set %>% group_by(Month) %>% summarise(avg_une=mean(Unemployment))
Unem_sales <- inner_join(sales_m,features_Unem_m,by="Month")
ggplot(Unem_sales, aes(x = Month, y = avg_une, size = ort_sa)) +
geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")
features_data_set$CPI<-as.numeric(as.vector(features_data_set$CPI))
features_CPI_ <- features_data_set %>% group_by(Month) %>% summarise(avg_cpi=mean(CPI))
cpi_sales <- inner_join(sales_m,features_CPI_,by="Month")
ggplot(cpi_sales, aes(x = Month, y = avg_cpi, size = ort_sa)) +
geom_point(shape = 21,colour = "#000000", fill = "#40b8d0")
SalesStorepca <- left_join(sales_data_set, stores_data_set, by = "Store")
alldatapca <- inner_join(SalesStorepca, features_data_set , by = c("Store", "Year", "Month", "Day"))
selectcolpca <- alldatapca %>% select( Size, CPI ,Unemployment ,Fuel_Price, Weekly_Sales, Temperature)
selectcolpca$CPI <- as.numeric(as.character(selectcolpca$CPI))
selectcolpca$Unemployment <- as.numeric(as.character(selectcolpca$Unemployment))
selectcolpca$Fuel_Price <- as.numeric(as.character(selectcolpca$Fuel_Price))
selectcolpca$Weekly_Sales <- as.numeric(as.character(selectcolpca$Weekly_Sales))
selectcolpca$Temperature <- as.numeric(as.character(selectcolpca$Temperature))
prout <- prcomp(selectcolpca, center = TRUE ,scale = TRUE)
prvar <- prout$sdev^2
pve <- prvar / sum(prvar)
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Variability of each principal component and variance explained by each principal component.
alldata <- inner_join(SalesStore, features_data_set , by = c("Store", "Year", "Month", "Day"))
selectcol <- alldata %>% select( Size, CPI ,Unemployment ,Fuel_Price, Weekly_Sales, Temperature)
selectcol$CPI <- as.numeric(as.character(selectcol$CPI))
selectcol$Unemployment <- as.numeric(as.character(selectcol$Unemployment))
selectcol$Fuel_Price <- as.numeric(as.character(selectcol$Fuel_Price))
selectcol$Weekly_Sales <- as.numeric(as.character(selectcol$Weekly_Sales))
selectcol$Temperature <- as.numeric(as.character(selectcol$Temperature))
matrixdata <- as.matrix(selectcol)
corrplot(cor(selectcol) ,method = "circle")
We can comfortably see that there is a strong relationship between size and weekly sales than the others features.
Sales<-data.table(sales_data_set)
Features<-data.table(features_data_set)
Stores<-data.table(stores_data_set)
Sales<-Sales[,list(Store,Dept,Date,Weekly_Sales)]
setkey(Sales,Store,Date)
setkey(Features,Store,Date)
Sales<-Features[Sales]
setkey(Sales,Store)
setkey(Stores,Store)
Sales<-Stores[Sales]
#str(Sales)
#summary(Sales)
head(Sales)
## Store Type Size Date Temperature Fuel_Price MarkDown1 MarkDown2
## 1: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## 2: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## 3: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## 4: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## 5: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## 6: 1 A 151315 01/04/2011 59.17 3.524 NA NA
## MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday Year
## 1: NA NA NA 214.8372 7.682 FALSE 2011
## 2: NA NA NA 214.8372 7.682 FALSE 2011
## 3: NA NA NA 214.8372 7.682 FALSE 2011
## 4: NA NA NA 214.8372 7.682 FALSE 2011
## 5: NA NA NA 214.8372 7.682 FALSE 2011
## 6: NA NA NA 214.8372 7.682 FALSE 2011
## Month Day Dept Weekly_Sales
## 1: 04 01 1 20398.09
## 2: 04 01 2 46991.58
## 3: 04 01 3 8734.19
## 4: 04 01 4 34451.90
## 5: 04 01 5 23598.55
## 6: 04 01 6 3249.27
we also prepared the data with data.table package that is hepful for manupulating big data faster than dplyr.
ggplot(Sales, aes(x =Date , y =Weekly_Sales , fill = IsHoliday)) +
geom_point(size = 2, shape = 23, color = "#4ABEFF")
We wanted to see the sales on holidays
ggplot(Sales, aes(Type, fill = Type )) +
geom_bar() +
xlab("Type of Store") + ylab("Count of Store")+ facet_grid(~ Year)
We had a look at count of the different types of the stores. There are more stores with Type A.
YearlySales<-Sales[,sum(Weekly_Sales,na.rm = TRUE),.(Store,Type,Size)]
setnames(YearlySales,"V1","Yearly_Sales")
ggplot(YearlySales,aes(x=Size,y=Yearly_Sales)) +
geom_point()+
geom_smooth(method=lm,color="RED",se = FALSE)+
scale_x_continuous(waiver()) + scale_y_continuous(waiver())
We expected to see there is a strong relationship between the sales and the store sizes. We checked it on this graph. As you can see relationship is positive and strong.
Sales[,Month:=as.numeric(substring(as.character(Date),4,5))]
Sales[,Year:=as.numeric(substring(as.character(Date),7,10))]
Sales$Date<-dmy(Sales$Date)
Sales[,Week:=week(Date)]
Sales[,YearWeek:=as.numeric(ifelse(Week<10,paste(Year,"0",Week,sep=""),paste(Year,Week,sep="")))]
Sales[,SizeC:=ifelse(Type=="A",1,ifelse(Type=="B",2,3))]
We created year, month, week, yearweek coloumns by using data.table package.
MonthlySales<-Sales[,sum(Weekly_Sales,na.rm = TRUE),.(Store,Month)]
setnames(MonthlySales,"V1","Monthly_Sales")
MonthlySales[,TotalSales:=sum(Monthly_Sales,na.rm = TRUE),.(Month)]
MonthlySales[,SalesPercantage:=Monthly_Sales*1.0/TotalSales]
summary(MonthlySales[,SalesPercantage])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004148 0.011775 0.020600 0.022222 0.030277 0.046391
We created MonthlySales table to see the montly sales of three years, to calculate the percantages of the store in total monthly sales.
We wanted to cluster the data to group the stores that have the same patterns.
Firstly, we looked at the percantage of the stores on the total sales at the related month. Then visualized the clustering.
Clusno<-5 #number of clusters was set.
#store/month matrix had created with dcast function.
CM=dcast.data.table(MonthlySales,Store~Month,value.var="SalesPercantage")
S<-colnames(CM)
CM<-data.frame(CM)
CM[is.na(CM)]=0 #NA's was filled with 0.
colnames(CM)<-S
CM<-data.table(CM)
head(CM)
## Store 1 2 3 4 5
## 1: 1 0.033685490 0.034296378 0.034381160 0.033427863 0.033215731
## 2: 2 0.041689678 0.042855724 0.041703736 0.040708568 0.040735639
## 3: 3 0.008887591 0.009148494 0.008936444 0.008553617 0.008462286
## 4: 4 0.046157386 0.046072825 0.044761484 0.043421522 0.043781330
## 5: 5 0.006880421 0.006764307 0.006863682 0.007001230 0.006866335
## 6: 6 0.031829294 0.032476623 0.034275101 0.033896024 0.032960172
## 6 7 8 9 10 11
## 1: 0.032603054 0.031855466 0.032815005 0.033401891 0.033291832 0.032324483
## 2: 0.040681090 0.039530651 0.040181861 0.039947257 0.040663416 0.040842843
## 3: 0.008370371 0.008119139 0.008188921 0.008501163 0.008614132 0.008645907
## 4: 0.043081218 0.043044542 0.044210832 0.045114239 0.045680079 0.045114709
## 5: 0.006695853 0.006456076 0.006519913 0.006863876 0.006715309 0.007007696
## 6: 0.034720069 0.034801612 0.032636899 0.031545752 0.031298558 0.032979546
## 12
## 1: 0.031047761
## 2: 0.041548796
## 3: 0.008398704
## 4: 0.044399801
## 5: 0.006519450
## 6: 0.034087565
# basl<-which(colnames(rr)=="2")
# bitis<-which(colnames(rr)=="269")
set.seed(7)
CM[,clusno:=kmeans(CM[,c(2:ncol(CM)),with=F],Clusno)$cluster]
clusters<-CM[,list(Store,clusno)] #clusters created with k-mean.
setkey(clusters,Store)
setkey(MonthlySales,Store)
MonthlySales<-clusters[MonthlySales] #MonthlySales and Clusters tables were merged.
SalesP<-dcast.data.table(MonthlySales,Month~Store,value.var="SalesPercantage") #store-month matrix had created and filled with sales percantages
MonthlySales$Month <- factor(MonthlySales$Month)
MonthlySales$Store <- factor(MonthlySales$Store)
MonthlySales$clusno <- factor(MonthlySales$clusno)
#month, store, clusno variables were converted as factor.
# plotting reference lines across each facet:
referenceLines <- MonthlySales # \/ Rename
colnames(referenceLines)[2] <- "groupVar"
zp <- ggplot(MonthlySales,
aes(x = Month, y = SalesPercantage))
zp <- zp + geom_line(data = referenceLines, # Plotting the "underlayer"
aes(x = Month, y = SalesPercantage, group = groupVar),
colour = "BLUE", alpha = 1/2, size = 1/2)
zp <- zp + geom_line(size = 1) # Drawing the "overlayer"
zp <- zp + facet_wrap(~ Store)
zp <- zp + theme_bw()
ggplotly()
plot<-ggplot(MonthlySales, aes(x=Month, y=SalesPercantage, color=clusno, group=Store)) +
geom_line()
ggplotly()
In this graph every line refers to a store and line shows the monthly sales trend. Everline with the same colour refers to a cluster group.
Features <- read.csv("Features data set.csv", header = TRUE, sep = ",")
Sales <- read.csv("sales data-set.csv", header = TRUE, sep = ",")
Stores <- read.csv("stores data-set.csv", header = TRUE, sep = ",")
Sales<-data.table(Sales)
Features<-data.table(Features)
Stores<-data.table(Stores)
Sales<-Sales[,list(Store,Dept,Date,Weekly_Sales)]
setkey(Sales,Store,Date)
setkey(Features,Store,Date)
Sales<-Features[Sales]
setkey(Sales,Store)
setkey(Stores,Store)
Sales<-Stores[Sales]
str(Sales)
## Classes 'data.table' and 'data.frame': 421570 obs. of 16 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Date : Factor w/ 143 levels "01/04/2011","01/06/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Temperature : num 59.2 59.2 59.2 59.2 59.2 ...
## $ Fuel_Price : num 3.52 3.52 3.52 3.52 3.52 ...
## $ MarkDown1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 215 215 215 215 215 ...
## $ Unemployment: num 7.68 7.68 7.68 7.68 7.68 ...
## $ IsHoliday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Dept : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Weekly_Sales: num 20398 46992 8734 34452 23599 ...
## - attr(*, "sorted")= chr "Store"
## - attr(*, ".internal.selfref")=<externalptr>
summary(Sales)
## Store Type Size Date
## Min. : 1.0 A:215478 Min. : 34875 23/12/2011: 3027
## 1st Qu.:11.0 B:163495 1st Qu.: 93638 25/11/2011: 3021
## Median :22.0 C: 42597 Median :140167 16/12/2011: 3013
## Mean :22.2 Mean :136728 09/12/2011: 3010
## 3rd Qu.:33.0 3rd Qu.:202505 17/02/2012: 3007
## Max. :45.0 Max. :219622 30/12/2011: 3003
## (Other) :403489
## Temperature Fuel_Price MarkDown1 MarkDown2
## Min. : -2.06 Min. :2.472 Min. : 0.27 Min. : -265.8
## 1st Qu.: 46.68 1st Qu.:2.933 1st Qu.: 2240.27 1st Qu.: 41.6
## Median : 62.09 Median :3.452 Median : 5347.45 Median : 192.0
## Mean : 60.09 Mean :3.361 Mean : 7246.42 Mean : 3334.6
## 3rd Qu.: 74.28 3rd Qu.:3.738 3rd Qu.: 9210.90 3rd Qu.: 1926.9
## Max. :100.14 Max. :4.468 Max. :88646.76 Max. :104519.5
## NA's :270889 NA's :310322
## MarkDown3 MarkDown4 MarkDown5 CPI
## Min. : -29.10 Min. : 0.22 Min. : 135.2 Min. :126.1
## 1st Qu.: 5.08 1st Qu.: 504.22 1st Qu.: 1878.4 1st Qu.:132.0
## Median : 24.60 Median : 1481.31 Median : 3359.4 Median :182.3
## Mean : 1439.42 Mean : 3383.17 Mean : 4629.0 Mean :171.2
## 3rd Qu.: 103.99 3rd Qu.: 3595.04 3rd Qu.: 5563.8 3rd Qu.:212.4
## Max. :141630.61 Max. :67474.85 Max. :108519.3 Max. :227.2
## NA's :284479 NA's :286603 NA's :270138
## Unemployment IsHoliday Dept Weekly_Sales
## Min. : 3.879 Mode :logical Min. : 1.00 Min. : -4989
## 1st Qu.: 6.891 FALSE:391909 1st Qu.:18.00 1st Qu.: 2080
## Median : 7.866 TRUE :29661 Median :37.00 Median : 7612
## Mean : 7.960 Mean :44.26 Mean : 15981
## 3rd Qu.: 8.572 3rd Qu.:74.00 3rd Qu.: 20206
## Max. :14.313 Max. :99.00 Max. :693099
##
Creating new variables(Month,Year)
Sales[,Month:=as.numeric(substring(as.character(Date),4,5))]
Sales[,Year:=as.numeric(substring(as.character(Date),7,10))]
Converting the date variable into date.format type with lubridate package
Sales$Date<-dmy(Sales$Date)
str(Sales$Date)
## Date[1:421570], format: "2011-04-01" "2011-04-01" "2011-04-01" "2011-04-01" "2011-04-01" ...
Creating new variables(Week,YearWeek,SizeC)
Sales[,Week:=week(Date)]
Sales[,YearWeek:=as.numeric(ifelse(Week<10,paste(Year,"0",Week,sep=""),paste(Year,Week,sep="")))]
Sales[,SizeC:=ifelse(Type=="A",1,ifelse(Type=="B",2,3))]
ggplot(Sales[Type=="A"& Store==4,],
aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
geom_smooth(size = 2) + facet_wrap(~Store) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(Sales[Type=="B" & Store==10,],
aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
geom_smooth(size = 5) + facet_wrap(~Store) +
theme(axis.text.x = element_text(angle = 90))
ggplot(Sales[Type=="C"& Store==30,],
aes(x = YearWeek, y = Weekly_Sales, color = factor(Type))) +
geom_smooth(size = 2) + facet_wrap(~Store) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Aggrigating the sales to Week and Store level
SalesWeeklybyStores<-Sales[,sum(Weekly_Sales),.(Year,Month,Week,YearWeek,Store,Type,Size,Temperature,Fuel_Price,CPI,Unemployment,IsHoliday)]
Converting the ISHoliday variable from logical to binary
SalesWeeklybyStores[,IsHoliday:=ifelse(IsHoliday=="FALSE",0,1)]
setnames(SalesWeeklybyStores,"V1","WeeklySales")
str(SalesWeeklybyStores)
## Classes 'data.table' and 'data.frame': 6435 obs. of 13 variables:
## $ Year : num 2011 2012 2011 2010 2012 ...
## $ Month : num 4 6 7 10 3 4 7 9 12 2 ...
## $ Week : int 14 22 27 40 9 14 27 36 49 5 ...
## $ YearWeek : num 201114 201222 201127 201040 201209 ...
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## $ Size : int 151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
## $ Temperature : num 59.2 78 85.5 71.9 61 ...
## $ Fuel_Price : num 3.52 3.5 3.52 2.6 3.63 ...
## $ CPI : num 215 222 215 212 221 ...
## $ Unemployment: num 7.68 7.14 7.96 7.84 7.35 ...
## $ IsHoliday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ WeeklySales : num 1495065 1624478 1488538 1453330 1688421 ...
## - attr(*, ".internal.selfref")=<externalptr>
Creating new features with lag of sales,fuelprice and temperature
setkey(SalesWeeklybyStores,Store,YearWeek)
SalesWeeklybyStores[, Weekly_Sales_lag1:=dplyr::lag(WeeklySales,1),.(Store)]
SalesWeeklybyStores[, Fuel_Price_lag1:=dplyr::lag(Fuel_Price,1),.(Store)]
SalesWeeklybyStores[, Temperature_lag1:=dplyr::lag(Temperature,1),.(Store)]
Setting the train and test data sets
trainset = SalesWeeklybyStores[YearWeek<=201215 &YearWeek>201014,]
testset = SalesWeeklybyStores[YearWeek>201215,]
testset[, id := .I]
summary(trainset)
## Year Month Week YearWeek
## Min. :2010 Min. : 1.000 Min. : 1.00 Min. :201015
## 1st Qu.:2010 1st Qu.: 4.000 1st Qu.:14.00 1st Qu.:201041
## Median :2011 Median : 7.000 Median :27.00 Median :201116
## Mean :2011 Mean : 6.566 Mean :27.13 Mean :201104
## 3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:40.00 3rd Qu.:201142
## Max. :2012 Max. :12.000 Max. :53.00 Max. :201215
## Store Type Size Temperature Fuel_Price
## Min. : 1 A:2332 Min. : 34875 Min. : -2.06 Min. :2.513
## 1st Qu.:12 B:1802 1st Qu.: 70713 1st Qu.: 45.68 1st Qu.:2.903
## Median :23 C: 636 Median :126512 Median : 60.52 Median :3.308
## Mean :23 Mean :130288 Mean : 59.19 Mean :3.308
## 3rd Qu.:34 3rd Qu.:202307 3rd Qu.: 73.17 3rd Qu.:3.675
## Max. :45 Max. :219622 Max. :100.14 Max. :4.294
## CPI Unemployment IsHoliday WeeklySales
## Min. :126.1 Min. : 4.125 Min. :0.00000 Min. : 209986
## 1st Qu.:132.1 1st Qu.: 7.082 1st Qu.:0.00000 1st Qu.: 554929
## Median :182.6 Median : 7.943 Median :0.00000 Median : 954629
## Mean :170.8 Mean : 8.141 Mean :0.07547 Mean :1049687
## 3rd Qu.:211.7 3rd Qu.: 8.622 3rd Qu.:0.00000 3rd Qu.:1417454
## Max. :225.3 Max. :14.313 Max. :1.00000 Max. :3818686
## Weekly_Sales_lag1 Fuel_Price_lag1 Temperature_lag1
## Min. : 209986 Min. :2.513 Min. : -2.06
## 1st Qu.: 554929 1st Qu.:2.886 1st Qu.: 45.64
## Median : 955465 Median :3.290 Median : 60.45
## Mean :1050483 Mean :3.297 Mean : 59.14
## 3rd Qu.:1418001 3rd Qu.:3.664 3rd Qu.: 73.16
## Max. :3818686 Max. :4.294 Max. :100.14
Creating the performance measurement function (RMPSE)
RMPSE<- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
elab<-exp(as.numeric(labels))-1
epreds<-exp(as.numeric(preds))-1
err <- sqrt(mean((epreds/elab-1)^2))
return(list(metric = "RMPSE", value = err))
}
Setting a sample size for training
nrow(SalesWeeklybyStores)
## [1] 6435
h<-sample(nrow(trainset),600)
Crating the gbm matrix. Added 1 to sales because of log(0) goes inf
dval<-xgb.DMatrix(data=data.matrix(trainset[h,]),label=log(trainset$WeeklySales+1)[h])
dtrain<-xgb.DMatrix(data=data.matrix(trainset[-h,]),label=log(trainset$WeeklySales+1)[-h])
watchlist<-list(val=dval,trainset=dtrain)
Setting the parameters
param <- list( objective = "reg:linear",
booster = "gbtree",
eta = 0.25, # 0.05, #0.1,
max_depth = 5,
subsample = 0.7,
colsample_bytree = 0.7
)
Finding the best seed number for maximum accuracy
selectseed<-function(a,b){
bestseed2 = data.table(s = 201, x = 1.1, y = 1.1)
for (s in a:b){
set.seed(s)
set.seed(s)
xgbmodel <- xgb.train( params = param,
data = dtrain,
nrounds = 30, #10, #20, #30, #100, # changed from 30
verbose = 0,
early.stop.round = 25,
watchlist = watchlist,
maximize = FALSE,
feval=RMPSE
)
testset<-data.frame(testset)
feature.names <- names(testset)[-c(13,17)]
forecasts <- exp(predict(xgbmodel, data.matrix(testset[,feature.names]))) -1
forecasts<-data.table(forecasts)
submission <- data.frame(id=testset$id, Sales=forecasts)
submission<-data.table(submission)
testset<-data.table(testset)
setkey(submission,id)
setkey(testset,id)
final<-testset[submission]
final[,Err:=abs(forecasts-WeeklySales)/WeeklySales]
final2<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store)]
setnames(final2,"V1","TWS")
setnames(final2,"V2","TE")
setnames(final2,"V3","TP")
x<-final2[,sum(TE)/sum(TWS)]
y<-final[,mean(Err)]
bestseed<-data.table(s,x,y)
bestseed2<-rbind(bestseed,bestseed2)
}
seed<-head(bestseed2[order(x)],1)[]$s
return(seed)
}
Setting the best seed the number between a to b
a=1
b=60
c = selectseed(a,b)
c
## [1] 43
Setting the seed number
set.seed(c)
set.seed(c)
Training the model
xgbmodel <- xgb.train( params = param,
data = dtrain,
nrounds = 30, #10, #20, #30, #100, # changed from 30
verbose = 1,
early.stop.round = 25,
watchlist = watchlist,
maximize = FALSE,
feval=RMPSE
)
## [1] val-RMPSE:0.999942 trainset-RMPSE:0.999945
## Multiple eval metrics are present. Will use trainset_RMPSE for early stopping.
## Will train until trainset_RMPSE hasn't improved in 25 rounds.
##
## [2] val-RMPSE:0.999333 trainset-RMPSE:0.999364
## [3] val-RMPSE:0.995876 trainset-RMPSE:0.996033
## [4] val-RMPSE:0.983601 trainset-RMPSE:0.984199
## [5] val-RMPSE:0.954389 trainset-RMPSE:0.955691
## [6] val-RMPSE:0.901670 trainset-RMPSE:0.903903
## [7] val-RMPSE:0.824705 trainset-RMPSE:0.827982
## [8] val-RMPSE:0.729323 trainset-RMPSE:0.733496
## [9] val-RMPSE:0.624876 trainset-RMPSE:0.629821
## [10] val-RMPSE:0.521029 trainset-RMPSE:0.526172
## [11] val-RMPSE:0.424459 trainset-RMPSE:0.429680
## [12] val-RMPSE:0.339660 trainset-RMPSE:0.344853
## [13] val-RMPSE:0.267867 trainset-RMPSE:0.272932
## [14] val-RMPSE:0.209317 trainset-RMPSE:0.213797
## [15] val-RMPSE:0.162793 trainset-RMPSE:0.167009
## [16] val-RMPSE:0.126050 trainset-RMPSE:0.129664
## [17] val-RMPSE:0.097588 trainset-RMPSE:0.100406
## [18] val-RMPSE:0.075705 trainset-RMPSE:0.078255
## [19] val-RMPSE:0.059313 trainset-RMPSE:0.061563
## [20] val-RMPSE:0.046990 trainset-RMPSE:0.048967
## [21] val-RMPSE:0.038121 trainset-RMPSE:0.039693
## [22] val-RMPSE:0.032100 trainset-RMPSE:0.033313
## [23] val-RMPSE:0.027401 trainset-RMPSE:0.028277
## [24] val-RMPSE:0.023985 trainset-RMPSE:0.024712
## [25] val-RMPSE:0.021600 trainset-RMPSE:0.022039
## [26] val-RMPSE:0.020632 trainset-RMPSE:0.020770
## [27] val-RMPSE:0.019103 trainset-RMPSE:0.019047
## [28] val-RMPSE:0.018530 trainset-RMPSE:0.018302
## [29] val-RMPSE:0.017533 trainset-RMPSE:0.017323
## [30] val-RMPSE:0.017169 trainset-RMPSE:0.016820
testset<-data.frame(testset)
Setting the feature names in test dataset
feature.names <- names(testset)[-c(13,17)]
feature.names
## [1] "Year" "Month" "Week"
## [4] "YearWeek" "Store" "Type"
## [7] "Size" "Temperature" "Fuel_Price"
## [10] "CPI" "Unemployment" "IsHoliday"
## [13] "Weekly_Sales_lag1" "Fuel_Price_lag1" "Temperature_lag1"
Generating the forecast with predict function and the exp function converts the values log to normal
forecasts <- exp(predict(xgbmodel, data.matrix(testset[,feature.names]))) -1
forecasts<-data.table(forecasts)
Crating the final table which obtains forecasts and actual values
submission <- data.frame(id=testset$id, Sales=forecasts)
submission<-data.table(submission)
testset<-data.table(testset)
setkey(submission,id)
setkey(testset,id)
final<-testset[submission]
Measuring the performance
final[,Err:=abs(forecasts-WeeklySales)/WeeklySales]
final2<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store)]
final3<-final[,list(sum(WeeklySales),sum(abs(forecasts-WeeklySales)),sum(forecasts)),.(Store,YearWeek)]
setnames(final2,"V1","TWS") #Total Weekly Sales
setnames(final2,"V2","TE") #Total Errors
setnames(final2,"V3","TF") #Total Forecasts
setnames(final3,"V1","TWS") #Total Weekly Sales
setnames(final3,"V2","TE") #Total Errors
setnames(final3,"V3","TF") #Total Forecasts
Store and year week percentage errors of forecasts
final3[,PE:=sum(TE)/sum(TWS),.(Store,YearWeek,TWS,TE,TF)]
final3[,list(Store,PE)]
## Store PE
## 1: 1 0.047429468
## 2: 1 0.010801324
## 3: 1 0.131069442
## 4: 1 0.016546445
## 5: 1 0.001351193
## ---
## 1256: 45 0.016677900
## 1257: 45 0.014999523
## 1258: 45 0.012793222
## 1259: 45 0.009667758
## 1260: 45 0.049755044
plot1<-ggplot(final3, aes(YearWeek)) +
geom_line(aes(y = PE, colour = "PE")) +
facet_wrap(~final3$Store) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
plot2<-ggplot(final3,
aes(x = YearWeek, y = PE)) +
geom_smooth(size = 0.01) + facet_wrap(~Store) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly
## function (p = ggplot2::last_plot(), width = NULL, height = NULL,
## tooltip = "all", dynamicTicks = FALSE, layerData = 1, originalData = TRUE,
## source = "A", ...)
## {
## UseMethod("ggplotly", p)
## }
## <environment: namespace:plotly>
Store percentage errors of forecasts
final2[,PE:=sum(TE)/sum(TWS),.(Store)]
final2[,list(Store,PE)]
## Store PE
## 1: 1 0.05190645
## 2: 2 0.04212371
## 3: 3 0.03745798
## 4: 4 0.03448086
## 5: 5 0.04854706
## 6: 6 0.05098826
## 7: 7 0.05915099
## 8: 8 0.03512490
## 9: 9 0.04346671
## 10: 10 0.03231090
## 11: 11 0.05870505
## 12: 12 0.05148974
## 13: 13 0.03731989
## 14: 14 0.06185494
## 15: 15 0.04894954
## 16: 16 0.06033702
## 17: 17 0.07136176
## 18: 18 0.05285955
## 19: 19 0.05191724
## 20: 20 0.04168502
## 21: 21 0.04170725
## 22: 22 0.04721929
## 23: 23 0.05292478
## 24: 24 0.06237871
## 25: 25 0.03578157
## 26: 26 0.04917069
## 27: 27 0.05101435
## 28: 28 0.11176581
## 29: 29 0.05832922
## 30: 30 0.02661010
## 31: 31 0.03638163
## 32: 32 0.03881713
## 33: 33 0.07282118
## 34: 34 0.03125023
## 35: 35 0.05379618
## 36: 36 0.04295618
## 37: 37 0.03514427
## 38: 38 0.06851168
## 39: 39 0.04595508
## 40: 40 0.07017110
## 41: 41 0.04371120
## 42: 42 0.08121408
## 43: 43 0.04285009
## 44: 44 0.03047717
## 45: 45 0.04182988
## Store PE
Total percentage error of forecasts whole test data
final[,mean(Err)]
## [1] 0.04928789
Printing top 10 nodes of the model
model <- xgb.dump(xgbmodel, with.stats = T)
model[1:10]
## [1] "booster[0]"
## [2] "0:[f12<797364] yes=1,no=2,missing=1,gain=555.558,cover=2873"
## [3] "1:leaf=3.14284,cover=1103"
## [4] "2:leaf=3.40188,cover=1770"
## [5] "booster[1]"
## [6] "0:[f12<1.01431e+006] yes=1,no=2,missing=1,gain=357.353,cover=2937"
## [7] "1:[f12<460641] yes=3,no=4,missing=3,gain=37.5645,cover=1576"
## [8] "3:leaf=2.27091,cover=461"
## [9] "4:leaf=2.42762,cover=1115"
## [10] "2:leaf=2.58059,cover=1361"
Printing the importance matrix
names <- dimnames(data.matrix(testset[,-c(13,17)]))[[2]]
importance_matrix <- xgb.importance(names, model = xgbmodel)
xgb.plot.importance(importance_matrix[1:5,])
Printing the first 10 tree
xgb.plot.tree(model = xgbmodel, n_first_tree = 10)
Plotting the actual sales and forecasts
z<-ggplot(final, aes(YearWeek)) +
geom_line(aes(y = WeeklySales, colour = "WeeklySales")) +
geom_line(aes(y = forecasts, colour = "forecasts")) +
facet_wrap(~final$Store) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly()
###References #_Retail Data Analytics. (2017, August). Retrieved from https://www.kaggle.com/manjeetsingh/retaildataset_ #_XGBoost w/Custom Error Function https://www.kaggle.com/chipmonkey/obligatory-xgboost-example