First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in August 2017 sales. We download the data change the name to odd_retail_sales_2017_08.xlsx. We will make a reproducible example of data analysis from the raw data located somewhere to the final analysis.
Download Raw Data Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.
raw_data <- read_excel("C:\\Users\\Efehan\\Documents\\odd_retail_sales_2017_08.xlsx",skip=7,col_names=FALSE)
# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data[-c(46,47),]
# Let's see our cleaned raw data
print(raw_data,n=45)
## # A tibble: 45 x 10
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO NA 19 19 NA NA 0 0 19 19
## 2 ASTON MARTIN NA 1 1 NA NA 0 0 1 1
## 3 AUDI NA 1055 1055 NA NA 0 0 1055 1055
## 4 BENTLEY NA 2 2 NA NA 0 0 2 2
## 5 BMW NA 1285 1285 NA NA 0 0 1285 1285
## 6 CHERY NA NA 0 NA NA 0 0 0 0
## 7 CITROEN NA 1321 1321 0 682 682 0 2003 2003
## 8 DACIA NA 3467 3467 NA 538 538 0 4005 4005
## 9 DS NA 1 1 NA NA 0 0 1 1
## 10 FERRARI NA 3 3 NA NA 0 0 3 3
## 11 FIAT 4881 184 5065 3885 677 4562 8766 861 9627
## 12 FORD 155 2826 2981 4184 423 4607 4339 3249 7588
## 13 GEELY NA NA 0 NA NA 0 0 0 0
## 14 HONDA 1155 510 1665 NA NA 0 1155 510 1665
## 15 HYUNDAI 1415 2299 3714 NA 248 248 1415 2547 3962
## 16 INFINITI NA 13 13 NA NA 0 0 13 13
## 17 ISUZU NA NA 0 159 16 175 159 16 175
## 18 IVECO NA NA 0 NA 221 221 0 221 221
## 19 JAGUAR NA 34 34 NA NA 0 0 34 34
## 20 JEEP NA 174 174 NA NA 0 0 174 174
## 21 KARSAN NA NA 0 112 NA 112 112 0 112
## 22 KIA NA 920 920 NA 209 209 0 1129 1129
## 23 LAMBORGHINI NA 0 0 NA NA 0 0 0 0
## 24 LAND ROVER NA 170 170 NA NA 0 0 170 170
## 25 LEXUS NA 2 2 NA NA 0 0 2 2
## 26 MASERATI NA 6 6 NA NA 0 0 6 6
## 27 MAZDA NA 120 120 NA NA 0 0 120 120
## 28 MERCEDES-BENZ NA 2324 2324 NA 845 845 0 3169 3169
## 29 MINI NA 133 133 NA NA 0 0 133 133
## 30 MITSUBISHI NA 25 25 24 228 252 24 253 277
## 31 NISSAN NA 2429 2429 NA 233 233 0 2662 2662
## 32 OPEL NA 3564 3564 NA NA 0 0 3564 3564
## 33 OTOKAR NA NA 0 NA NA 0 0 0 0
## 34 PEUGEOT NA 2412 2412 0 805 805 0 3217 3217
## 35 PORSCHE NA 37 37 NA NA 0 0 37 37
## 36 RENAULT 4817 3714 8531 NA 1375 1375 4817 5089 9906
## 37 SEAT NA 1030 1030 NA NA 0 0 1030 1030
## 38 SKODA NA 1937 1937 NA NA 0 0 1937 1937
## 39 SMART NA 6 6 NA NA 0 0 6 6
## 40 SSANGYONG NA 24 24 NA 49 49 0 73 73
## 41 SUBARU NA 131 131 NA NA 0 0 131 131
## 42 SUZUKI NA 458 458 NA NA 0 0 458 458
## 43 TOYOTA 2033 501 2534 NA 328 328 2033 829 2862
## 44 VOLKSWAGEN NA 6957 6957 NA 2405 2405 0 9362 9362
## 45 VOLVO NA 340 340 NA NA 0 0 340 340
In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also. One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.
# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_aug_18 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2017,month=8)
print(car_data_aug_18,width=Inf)
## # A tibble: 45 x 12
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO 0 19 19 0 0 0
## 2 ASTON MARTIN 0 1 1 0 0 0
## 3 AUDI 0 1055 1055 0 0 0
## 4 BENTLEY 0 2 2 0 0 0
## 5 BMW 0 1285 1285 0 0 0
## 6 CHERY 0 0 0 0 0 0
## 7 CITROEN 0 1321 1321 0 682 682
## 8 DACIA 0 3467 3467 0 538 538
## 9 DS 0 1 1 0 0 0
## 10 FERRARI 0 3 3 0 0 0
## total_dom total_imp total_total year month
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 19 19 2017 8
## 2 0 1 1 2017 8
## 3 0 1055 1055 2017 8
## 4 0 2 2 2017 8
## 5 0 1285 1285 2017 8
## 6 0 0 0 2017 8
## 7 0 2003 2003 2017 8
## 8 0 4005 4005 2017 8
## 9 0 1 1 2017 8
## 10 0 3 3 2017 8
## # ... with 35 more rows
#Save as RDS file
saveRDS(car_data_aug_18,file="C:\\Users\\Efehan\\Documents\\odd_car_sales_data_aug_18.rds")
After cleaning 0’s and NaN’s we see that 9 out of 45 brands sold cars produced in Turkey while all others produced abroad. Further analysis can be made according to number of cars sold. Secondly I checked which brand with no domestic production sold at most in Turkey.
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_aug_18 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2017,month=8)
# Let's look for share of domestic production in car sales by brand.First I am adding a new column to see the percentage.
car_data_aug_18 <- car_data_aug_18 %>% mutate(domtotratio = round(total_dom / total_total * 100))
# Since at some months there is no sales at all for certain brands, we got NaN values (standing for not number) which can distort our plot. Below we get rid of those rows which are the brands below.
print(filter(car_data_aug_18,car_data_aug_18$domtotratio == "NaN"))
## # A tibble: 4 x 13
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CHERY 0 0 0 0 0 0
## 2 GEELY 0 0 0 0 0 0
## 3 LAMBORGHI~ 0 0 0 0 0 0
## 4 OTOKAR 0 0 0 0 0 0
## # ... with 6 more variables: total_dom <dbl>, total_imp <dbl>,
## # total_total <dbl>, year <dbl>, month <dbl>, domtotratio <dbl>
car_data_aug_18_withoutnan <- car_data_aug_18[!is.nan(car_data_aug_18$domtotratio),]
#Now we plot our analysis. .
ggplot(car_data_aug_18_withoutnan,aes(x=reorder(brand_name,domtotratio),y=domtotratio / 100))+
geom_col()+
coord_flip()+
geom_text(aes(label=domtotratio),color="orange")+
scale_y_continuous(labels = percent_format(),limits=c(0,1))+
ggtitle("Share of Domestic Production In Car Sales At August 2017 By Brand")
# Lastly I would like to see which imported brand without domestic production sold at most in Turkey at August 2017.
car_data_salewithoutdomprod <- car_data_aug_18_withoutnan %>%
arrange(desc(total_total)) %>%
filter(domtotratio == 0) %>%
select(brand_name, total_imp)
print(car_data_salewithoutdomprod,n=32)
## # A tibble: 32 x 2
## brand_name total_imp
## <chr> <dbl>
## 1 VOLKSWAGEN 9362
## 2 DACIA 4005
## 3 OPEL 3564
## 4 PEUGEOT 3217
## 5 MERCEDES-BENZ 3169
## 6 NISSAN 2662
## 7 CITROEN 2003
## 8 SKODA 1937
## 9 BMW 1285
## 10 KIA 1129
## 11 AUDI 1055
## 12 SEAT 1030
## 13 SUZUKI 458
## 14 VOLVO 340
## 15 IVECO 221
## 16 JEEP 174
## 17 LAND ROVER 170
## 18 MINI 133
## 19 SUBARU 131
## 20 MAZDA 120
## 21 SSANGYONG 73
## 22 PORSCHE 37
## 23 JAGUAR 34
## 24 ALFA ROMEO 19
## 25 INFINITI 13
## 26 MASERATI 6
## 27 SMART 6
## 28 FERRARI 3
## 29 BENTLEY 2
## 30 LEXUS 2
## 31 ASTON MARTIN 1
## 32 DS 1