#Before jumping to analysis part, let's have a look into data
data2 <- readRDS("car_data_aggregate.rds")
str(data2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1490 obs. of 12 variables:
## $ brand_name : chr "ALFA ROMEO" "ASTON MARTIN" "AUDI" "BENTLEY" ...
## $ auto_dom : num 0 0 0 0 0 0 0 0 0 632 ...
## $ auto_imp : num 13 2 350 0 158 ...
## $ auto_total : num 13 2 350 0 158 ...
## $ comm_dom : num 0 0 0 0 0 0 0 0 0 789 ...
## $ comm_imp : num 0 0 0 0 0 197 319 0 0 199 ...
## $ comm_total : num 0 0 0 0 0 197 319 0 0 988 ...
## $ total_dom : num 0 0 0 0 0 ...
## $ total_imp : num 13 2 350 0 158 331 1460 9 3 256 ...
## $ total_total: num 13 2 350 0 158 ...
## $ year : num 2018 2018 2018 2018 2018 ...
## $ month : num 9 9 9 9 9 9 9 9 9 9 ...
#It's always safe to check whether there is any duplicated/wrongly named entries when working with such a combined data set
distinct(data2[1])
## # A tibble: 51 x 1
## brand_name
## <chr>
## 1 ALFA ROMEO
## 2 ASTON MARTIN
## 3 AUDI
## 4 BENTLEY
## 5 BMW
## 6 CITROEN
## 7 DACIA
## 8 DS
## 9 FERRARI
## 10 FIAT
## # ... with 41 more rows
#I've discovered that we should remove the row starts with "ODD..." & "TOPLAM" and also there are 2 different "ASTON MARTIN", we should give the same name to them
data2 <- data2[!grepl("ODD, ", data2$brand_name),]
data2 <- data2[!grepl("TOPLAM:", data2$brand_name),]
data2$brand_name <- str_replace(data2$brand_name, "ASTON MARTÄ°N", "ASTON MARTIN")
distinct(data2[1])
## # A tibble: 48 x 1
## brand_name
## <chr>
## 1 ALFA ROMEO
## 2 ASTON MARTIN
## 3 AUDI
## 4 BENTLEY
## 5 BMW
## 6 CITROEN
## 7 DACIA
## 8 DS
## 9 FERRARI
## 10 FIAT
## # ... with 38 more rows
#I'm curios about how the brand shraes will change when we have 3 years ranged dataset
data2 %>%
select(brand_name, total_total) %>%
group_by(brand_name) %>%
summarize(sales=sum(total_total)) %>%
mutate(share = round(sales/sum(sales),2)) %>%
arrange(desc(share))
## # A tibble: 48 x 3
## brand_name sales share
## <chr> <dbl> <dbl>
## 1 RENAULT 318500 0.14
## 2 FIAT 275900 0.12
## 3 FORD 271157 0.12
## 4 VOLKSWAGEN 262041 0.11
## 5 HYUNDAI 131666 0.06
## 6 DACIA 117978 0.05
## 7 OPEL 117122 0.05
## 8 TOYOTA 106950 0.05
## 9 MERCEDES-BENZ 93944 0.04
## 10 NISSAN 82756 0.04
## # ... with 38 more rows
Renault is the market leader in unit sales. Fiat, Ford and Volkswagen follows it and shares are so close to each other… Also, it’s worth to note that there is sharp decline in unit sales 4th and 5th player which are Volkswagen and Hyundai…
#Then it would be useful to get insight on how unit sales changes during the yar. So plotting monthly sales will give us an understanding on seasonality of care sales
#But graph wouldn't look beautiful if we include all brands, i think including only top 4 brands is enought. We've already known that top 4 brands are:
#Renault, Fiat, Ford and Volswagen. For 2016 and 2017 respectively
#2016
data3 <- data2 %>%
filter(data2$year == 2016 & brand_name %in% c("RENAULT", "FIAT", "FORD", "VOLKSWAGEN"))
ggplot(data=data3, aes(x=month, y=total_total, color=brand_name)) + theme_bw() +
theme(legend.key = element_blank()) +
geom_point(size=4, alpha=0.5) + geom_smooth(se=FALSE) +
labs(title="Monthly Sales in 2016 for Top 4 Brands")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#2017
data4 <- data2 %>%
filter(data2$year == 2017 & brand_name %in% c("RENAULT", "FIAT", "FORD", "VOLKSWAGEN"))
ggplot(data=data4, aes(x=month, y=total_total, color=brand_name)) + theme_bw() +
theme(legend.key = element_blank()) +
geom_point(size=4, alpha=0.5) + geom_smooth(se=FALSE) +
labs(title="Monthly Sales in 2017 for Top 4 Brands")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It’s clear that firs quarter is the low-season for car sales and sales peak at the end of the year. In December, brands sale more than tripled compared to January or February.