First, I introduced the dataset then I did some manipulation to rows that contain wrong information. I also created a joint ‘Date’ variable from year and month.
rds_data <- readRDS("C:/Users/kerim.acar/Desktop/BDA/BDA503/car_data_aggregate.rds")
rds_data <- rds_data %>%
filter(!(startsWith(brand_name, "ODD") | startsWith(brand_name, "TOPLAM")))
rds_data$Date <- as.Date(as.yearmon(paste(rds_data$year, rds_data$month), "%Y %m"), frac = 1)
rds_data <- rds_data %>%
arrange(Date)
I wanted to see the best and the worst sellers in the time period between January 2016 and September 2018.
rds_total_total <- rds_data %>%
group_by(brand_name) %>%
summarise(total_total = sum(total_total)) %>%
arrange(desc(total_total))
I wanted to analyze the sale performances of brands, so I included 3 top sellers(RENAULT, FIAT, FORD), 2 average sellers (AUDI, BMW) and 1 bottom seller (VOLVO).
selected_data<- subset(rds_data,brand_name == 'RENAULT' |
brand_name =='FIAT' | brand_name =='FORD' |
brand_name =='AUDI' | brand_name == 'VOLVO' |
brand_name == 'BMW')
selected_data <- selected_data[,c(1,10,13)]
head(selected_data)
## # A tibble: 6 x 3
## brand_name total_total Date
## <chr> <dbl> <date>
## 1 AUDI 911 2016-01-31
## 2 BMW 496 2016-01-31
## 3 FIAT 3843 2016-01-31
## 4 FORD 3770 2016-01-31
## 5 RENAULT 4519 2016-01-31
## 6 VOLVO 187 2016-01-31
I made a scatter plot of sales by time among selected brands.
ggplot(data=selected_data, aes(x=Date,y=total_total,color=brand_name))+
scale_x_date(labels = date_format("%m/%Y"),breaks = date_breaks(width = "2 months"))+
geom_point() +
scale_y_continuous(breaks = seq(0,21000, 1000)) +
theme_classic() + geom_line()+ theme(axis.text.x = element_text(angle=20)) +
ylab("Total Sales") + ggtitle("Total Sales by Time of Selected Brands")
As we can see from the graph, top sellers’ sales follow the same trend which increases sharply at the end of the year and decreases at the same scale after the beginning of the next year.
I changed the format of dataset which takes selected brands as columns, dates as rows and filled with total sales.
selected_data<- selected_data[,c("Date","brand_name","total_total")]
new<-selected_data %>%
spread(brand_name,total_total)
head(new)
## # A tibble: 6 x 7
## Date AUDI BMW FIAT FORD RENAULT VOLVO
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016-01-31 911 496 3843 3770 4519 187
## 2 2016-02-29 1200 2042 5502 6940 6162 194
## 3 2016-03-31 1455 2080 8400 9758 11293 311
## 4 2016-04-30 2148 2332 9521 9151 12075 335
## 5 2016-05-31 2352 3174 9636 10051 12741 512
## 6 2016-07-31 905 1856 5929 6370 6274 130
Then, I created a correlation matrix for selected brands. As shown in previous graph, top sellers(FIAT, FORD, RENAULT) are highly correlated with each other.
ggpairs(new[,-1])+ theme_classic()