library(tidyverse)
## -- Attaching packages --------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 3.0.0 <U+221A> purrr 0.2.5
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.8
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
data <- readRDS("car_data_aggregate.rds")
summary(data)
## brand_name auto_dom auto_imp auto_total
## Length:1259 Min. : 0 Min. : 0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 3 1st Qu.: 3
## Mode :character Median : 0 Median : 116 Median : 127
## Mean : 451 Mean : 1080 Mean : 1531
## 3rd Qu.: 0 3rd Qu.: 1344 3rd Qu.: 1928
## Max. :20665 Max. :48030 Max. :65799
## comm_dom comm_imp comm_total total_dom
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0
## Mean : 248.6 Mean : 219.0 Mean : 467.6 Mean : 699.7
## 3rd Qu.: 0.0 3rd Qu.: 193.5 3rd Qu.: 220.0 3rd Qu.: 0.0
## Max. :11642.0 Max. :9848.0 Max. :19623.0 Max. :30440.0
## total_imp total_total year month
## Min. : 0 Min. : 0 Min. :2016 Min. : 1.000
## 1st Qu.: 8 1st Qu.: 13 1st Qu.:2016 1st Qu.: 4.000
## Median : 171 Median : 188 Median :2017 Median : 7.000
## Mean : 1299 Mean : 1998 Mean :2017 Mean : 6.441
## 3rd Qu.: 1566 3rd Qu.: 2182 3rd Qu.:2018 3rd Qu.: 9.000
## Max. :57355 Max. :85422 Max. :2018 Max. :12.000
If we look at the data, there are no missing values in it.Thus, we can start to analyze this dataset. This dataset is merge of different months of car data. First I should ask questions about data and find the answers about it. Gradually, the dataset bring some clarity about my problems. Step by step, I will be investigating this dataset and conclude some cases about it.
By looking total values, which brands are the most valuable? To search this, I should use arrange function to look at the most valuable brands
data %>%
group_by(brand_name)%>%
summarize(avgTotal = mean(auto_total)) %>%
select(brand_name,avgTotal) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 3500) %>%
ggplot(data = ., aes(x = brand_name, y = avgTotal,
fill = brand_name)) + geom_bar(stat = "identity")
By looking into auto_total value, FIAT,HYUNDAI,RENAULT,OPEL and VOLKSWAGEN is the most valuble brands.
if we look for commercial total:
data %>%
group_by(brand_name)%>%
summarize(avgTotal = mean(comm_total)) %>%
select(brand_name,avgTotal) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 1000) %>%
ggplot(data = ., aes(x = brand_name, y = avgTotal,
fill = brand_name)) + geom_bar(stat = "identity")
Here, we see that FIAT,FORD;RENAULT and VOLKSWAGEN are the highest brands which commercing in total.
Lastly, we should look at total numbers of car business.If we collect data as total:
data %>%
group_by(brand_name)%>%
summarize(avgTotal = mean(total_total)) %>%
select(brand_name,avgTotal) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 4000) %>%
ggplot(data = ., aes(x = brand_name, y = avgTotal,
fill = brand_name)) + geom_bar(stat = "identity")
So here the which brands have made most total purchases. You can see the FIAT,FORD,HYNDAI,RENAULT and VOLKSWAGEN have made the most valuable trade.
We see that which companies have made the most valuble trades but also we should seek the improvement of brands. For this, we must look at the difference between two years and see which brands have most improvements.
First, we must look auto_total changes by year:
data %>%
group_by(brand_name,year)%>%
summarize(avgTotal = mean(auto_total)) %>%
select(brand_name,avgTotal,year) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 3000) %>%
ggplot(data = ., aes(x = year, y = avgTotal,
color = brand_name)) + geom_line()
Respectively, we can look the total commercial changes:
data %>%
group_by(brand_name,year)%>%
summarize(avgTotal = mean(comm_total)) %>%
select(brand_name,avgTotal,year) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 2000) %>%
ggplot(data = ., aes(x = year, y = avgTotal,
color = brand_name)) + geom_line()
Lastly, we can look at the total values. How have they changed? To investigate this, lets look at the total changes
data %>%
group_by(brand_name,year)%>%
summarize(avgTotal = mean(total_total)) %>%
select(brand_name,avgTotal,year) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 3000) %>%
ggplot(data = ., aes(x = year, y = avgTotal,
color = brand_name)) + geom_line()
All the total values of different variables falls after 2017. So, we can say, there is economic affect on brands at beginning of 2017.
What are the changes of monthly values of most valuable brands? We should investigate it by looking in the chart
data %>%
group_by(brand_name,year,month)%>%
summarize(avgTotal = mean(total_total)) %>%
select(brand_name,avgTotal,month) %>%
arrange(desc(avgTotal)) %>%
filter(avgTotal > 3000) %>%
ggplot(data = ., aes(x = month, y = avgTotal,
color = brand_name)) + geom_line()
## Adding missing grouping variables: `year`
So here, the valuable brands are correlating together as well as other low valuble group has a correlation.