The dataset subject to our analysis includes retail sales of automobiles and commercial vehicles in Turkey based on brand in monthly basis. The dataset includes data from Jan’16 to Sep’19. The dataset incorporates 12 columns. Details of the column names & description is are represented below:
In order to perform a healthy analysis, the dataset has been adjusted in three aspects: 1. Delete the invalid row that contains a disclaimer 2. Delete the rows brand_name = “TOPLAM:” 3. Change of “ASTON MARTIN” to “ASTON MARTIN” in order to merge the two different brand names for the same brand.
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 3.0.0 <U+221A> purrr 0.2.5
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.6
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
setwd('C:/Users/Turhal/Documents/Data Science/Big Data Essentials')
raw_data = read_rds("car_data_aggregate.rds")
raw_data = subset(raw_data,subset=brand_name != 'TOPLAM:' & !startsWith(brand_name, "ODD"))
raw_data$brand_name = str_replace(raw_data$brand_name, "MARTIN", "MARTIN")
#raw_data = raw_data %>% slice(-46)
raw_data[is.na(raw_data)] <- 0
head(raw_data)
## # A tibble: 6 x 12
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO 0 13 13 0 0 0
## 2 ASTON MAR~ 0 2 2 0 0 0
## 3 AUDI 0 350 350 0 0 0
## 4 BENTLEY 0 0 0 0 0 0
## 5 BMW 0 158 158 0 0 0
## 6 CITROEN 0 134 134 0 197 197
## # ... with 5 more variables: total_dom <dbl>, total_imp <dbl>,
## # total_total <dbl>, year <dbl>, month <dbl>
Firstly, let’s see the top performers for the year 2016. In order to do that, we need to select the five most selling brands in total.
raw_data1 =
raw_data %>%
filter(auto_total > 0 & comm_total > 0) %>%
select(brand_name,auto_total,comm_total,total_total,year) %>%
arrange(desc(year),desc(total_total)) %>%
filter(brand_name != "TOPLAM:") %>%
filter(brand_name != "ODD*") %>%
filter(year == "2016") %>%
group_by(brand_name) %>%
summarize(year_total = sum(total_total)) %>%
arrange(desc(year_total))
raw_data1 %>% top_n(5) %>%
mutate(prop=percent(year_total / sum(year_total))) %>%
ggplot(data = ., aes(x="", y=prop, fill=brand_name))+ labs(x = "", y = "", title = "Top 5 Car Sales per Brand in 2016") +
geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) +
theme(plot.title=element_text(hjust=0.5,face='bold',size=16)) +
theme_classic() + theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) +
geom_text(aes(label=prop),position=position_stack(vjust = 0.5)) + guides(fill=guide_legend(title="Brand Name"))
## Selecting by year_total
Secondly and lastly, I want to compare total annual automobile sales of FIAT and TOYOTA:
#Total Annual Automobile Sales of FIAT and TOYOTA
raw_data %>%
group_by(brand_name, year) %>%
filter(brand_name %in% c("FIAT", "TOYOTA")) %>%
summarize(yearly_auto_total = sum(auto_total))%>%
ggplot(data=., aes(x=year, y=yearly_auto_total, fill=brand_name)) +
geom_bar(stat="identity", position=position_dodge())+
labs(x="Years", y = "Number Of Sales", fill="Brands") +
theme_minimal()
We can see that FIAT was in front by a little margin in 2016 but that margin goes higher in 2017 as TOYOTA had a dramatical decrease in sales and adversely FIAT had a decent increase. And this year looks a lot like 2016 in comparison.