Introduction & Manipulation of Dataset

First, I introduced the dataset then I did some manipulation to rows that contain wrong information. I also created a joint ‘Date’ variable from year and month.

rds_data <- readRDS("C:/Users/kerim.acar/Desktop/BDA/BDA503/car_data_aggregate.rds")
rds_data <- rds_data %>% 
  filter(!(startsWith(brand_name, "ODD") | startsWith(brand_name, "TOPLAM")))

rds_data$Date <- as.Date(as.yearmon(paste(rds_data$year, rds_data$month), "%Y %m"), frac = 1)
rds_data <- rds_data %>% 
              arrange(Date)

Total Sales by Brands

I wanted to see the best and the worst sellers in the time period between January 2016 and September 2018.

rds_total_total <- rds_data %>% 
  group_by(brand_name) %>% 
  summarise(total_total = sum(total_total)) %>%
  arrange(desc(total_total))

Top, Average and Bottom Sellers Selection

I wanted to analyze the sale performances of brands, so I included 3 top sellers(RENAULT, FIAT, FORD), 2 average sellers (AUDI, BMW) and 1 bottom seller (VOLVO).

selected_data<- subset(rds_data,brand_name == 'RENAULT' |
                    brand_name =='FIAT' | brand_name =='FORD' | 
                     brand_name =='AUDI' | brand_name == 'VOLVO' |
                     brand_name == 'BMW')
selected_data <- selected_data[,c(1,10,13)]
head(selected_data)
## # A tibble: 6 x 3
##   brand_name total_total Date      
##   <chr>            <dbl> <date>    
## 1 AUDI               911 2016-01-31
## 2 BMW                496 2016-01-31
## 3 FIAT              3843 2016-01-31
## 4 FORD              3770 2016-01-31
## 5 RENAULT           4519 2016-01-31
## 6 VOLVO              187 2016-01-31

Correlation Between Brands

I changed the format of dataset which takes selected brands as columns, dates as rows and filled with total sales.

selected_data<- selected_data[,c("Date","brand_name","total_total")]
new<-selected_data %>%
  spread(brand_name,total_total)
head(new)
## # A tibble: 6 x 7
##   Date        AUDI   BMW  FIAT  FORD RENAULT VOLVO
##   <date>     <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 2016-01-31   911   496  3843  3770    4519   187
## 2 2016-02-29  1200  2042  5502  6940    6162   194
## 3 2016-03-31  1455  2080  8400  9758   11293   311
## 4 2016-04-30  2148  2332  9521  9151   12075   335
## 5 2016-05-31  2352  3174  9636 10051   12741   512
## 6 2016-07-31   905  1856  5929  6370    6274   130

Then, I created a correlation matrix for selected brands. As shown in previous graph, top sellers(FIAT, FORD, RENAULT) are highly correlated with each other.

ggpairs(new[,-1])+ theme_classic()