Car Sales Data

Discover the Data

#Before jumping to analysis part, let's have a look into data

data2 <- readRDS("car_data_aggregate.rds")
str(data2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1490 obs. of  12 variables:
##  $ brand_name : chr  "ALFA ROMEO" "ASTON MARTIN" "AUDI" "BENTLEY" ...
##  $ auto_dom   : num  0 0 0 0 0 0 0 0 0 632 ...
##  $ auto_imp   : num  13 2 350 0 158 ...
##  $ auto_total : num  13 2 350 0 158 ...
##  $ comm_dom   : num  0 0 0 0 0 0 0 0 0 789 ...
##  $ comm_imp   : num  0 0 0 0 0 197 319 0 0 199 ...
##  $ comm_total : num  0 0 0 0 0 197 319 0 0 988 ...
##  $ total_dom  : num  0 0 0 0 0 ...
##  $ total_imp  : num  13 2 350 0 158 331 1460 9 3 256 ...
##  $ total_total: num  13 2 350 0 158 ...
##  $ year       : num  2018 2018 2018 2018 2018 ...
##  $ month      : num  9 9 9 9 9 9 9 9 9 9 ...

Clean the Data

#It's always safe to check whether there is any duplicated/wrongly named entries when working with such a combined data set
distinct(data2[1])

## # A tibble: 51 x 1
##    brand_name  
##    <chr>       
##  1 ALFA ROMEO  
##  2 ASTON MARTIN
##  3 AUDI        
##  4 BENTLEY     
##  5 BMW         
##  6 CITROEN     
##  7 DACIA       
##  8 DS          
##  9 FERRARI     
## 10 FIAT        
## # ... with 41 more rows

#I've discovered that we should remove the row starts with "ODD..." & "TOPLAM" and also there are 2 different "ASTON MARTIN", we should give the same name to them
data2 <- data2[!grepl("ODD, ", data2$brand_name),]
data2 <- data2[!grepl("TOPLAM:", data2$brand_name),]
data2$brand_name <- str_replace(data2$brand_name, "ASTON MARTİN", "ASTON MARTIN")
distinct(data2[1])

## # A tibble: 48 x 1
##    brand_name  
##    <chr>       
##  1 ALFA ROMEO  
##  2 ASTON MARTIN
##  3 AUDI        
##  4 BENTLEY     
##  5 BMW         
##  6 CITROEN     
##  7 DACIA       
##  8 DS          
##  9 FERRARI     
## 10 FIAT        
## # ... with 38 more rows

Add One More Dimension

#I'm curios about how the brand shraes will change when we have 3 years ranged dataset

data2 %>%
  select(brand_name, total_total) %>%
  group_by(brand_name) %>%
  summarize(sales=sum(total_total)) %>%
  mutate(share = round(sales/sum(sales),2)) %>%
  arrange(desc(share))

## # A tibble: 48 x 3
##    brand_name     sales share
##    <chr>          <dbl> <dbl>
##  1 RENAULT       318500  0.14
##  2 FIAT          275900  0.12
##  3 FORD          271157  0.12
##  4 VOLKSWAGEN    262041  0.11
##  5 HYUNDAI       131666  0.06
##  6 DACIA         117978  0.05
##  7 OPEL          117122  0.05
##  8 TOYOTA        106950  0.05
##  9 MERCEDES-BENZ  93944  0.04
## 10 NISSAN         82756  0.04
## # ... with 38 more rows

Renault is the market leader in unit sales. Fiat, Ford and Volkswagen follows it and shares are so close to each other… Also, it’s worth to note that there is sharp decline in unit sales 4th and 5th player which are Volkswagen and Hyundai…

Plotting Part

#Then it would be useful to get insight on how unit sales changes during the yar. So plotting monthly sales will give us an understanding on seasonality of care sales
#But graph wouldn't look beautiful if we include all brands, i think including only top 4 brands is enought. We've already known that top 4 brands are: 
#Renault, Fiat, Ford and Volswagen. For 2016 and 2017 respectively

#2016
data3 <- data2 %>%
  filter(data2$year == 2016 & brand_name %in% c("RENAULT", "FIAT", "FORD", "VOLKSWAGEN"))
ggplot(data=data3, aes(x=month, y=total_total, color=brand_name)) + theme_bw() + 
    theme(legend.key = element_blank()) +
    geom_point(size=4, alpha=0.5) + geom_smooth(se=FALSE) +
    labs(title="Monthly Sales in 2016 for Top 4 Brands")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#2017
data4 <- data2 %>%
  filter(data2$year == 2017 & brand_name %in% c("RENAULT", "FIAT", "FORD", "VOLKSWAGEN"))
ggplot(data=data4, aes(x=month, y=total_total, color=brand_name)) + theme_bw() + 
    theme(legend.key = element_blank()) +
    geom_point(size=4, alpha=0.5) + geom_smooth(se=FALSE) +
    labs(title="Monthly Sales in 2017 for Top 4 Brands")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It’s clear that firs quarter is the low-season for car sales and sales peak at the end of the year. In December, brands sale more than tripled compared to January or February.

Car Sales Data_v2

Ugurcan Kalenderoglu

November, 25 2018

Discover the Data

Clean the Data

Add One More Dimension

Plotting Part