From Raw to Civilized Data

First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in August 2017 sales. We download the data change the name to odd_retail_sales_2017_08.xlsx. We will make a reproducible example of data analysis from the raw data located somewhere to the final analysis.

Download Raw Data Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.

raw_data <- read_excel("C:\\Users\\Efehan\\Documents\\odd_retail_sales_2017_08.xlsx",skip=7,col_names=FALSE)

# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data[-c(46,47),]

# Let's see our cleaned raw data
print(raw_data,n=45)
## # A tibble: 45 x 10
##    X__1           X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10
##    <chr>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ALFA ROMEO       NA    19    19    NA    NA     0     0    19    19
##  2 ASTON MARTIN     NA     1     1    NA    NA     0     0     1     1
##  3 AUDI             NA  1055  1055    NA    NA     0     0  1055  1055
##  4 BENTLEY          NA     2     2    NA    NA     0     0     2     2
##  5 BMW              NA  1285  1285    NA    NA     0     0  1285  1285
##  6 CHERY            NA    NA     0    NA    NA     0     0     0     0
##  7 CITROEN          NA  1321  1321     0   682   682     0  2003  2003
##  8 DACIA            NA  3467  3467    NA   538   538     0  4005  4005
##  9 DS               NA     1     1    NA    NA     0     0     1     1
## 10 FERRARI          NA     3     3    NA    NA     0     0     3     3
## 11 FIAT           4881   184  5065  3885   677  4562  8766   861  9627
## 12 FORD            155  2826  2981  4184   423  4607  4339  3249  7588
## 13 GEELY            NA    NA     0    NA    NA     0     0     0     0
## 14 HONDA          1155   510  1665    NA    NA     0  1155   510  1665
## 15 HYUNDAI        1415  2299  3714    NA   248   248  1415  2547  3962
## 16 INFINITI         NA    13    13    NA    NA     0     0    13    13
## 17 ISUZU            NA    NA     0   159    16   175   159    16   175
## 18 IVECO            NA    NA     0    NA   221   221     0   221   221
## 19 JAGUAR           NA    34    34    NA    NA     0     0    34    34
## 20 JEEP             NA   174   174    NA    NA     0     0   174   174
## 21 KARSAN           NA    NA     0   112    NA   112   112     0   112
## 22 KIA              NA   920   920    NA   209   209     0  1129  1129
## 23 LAMBORGHINI      NA     0     0    NA    NA     0     0     0     0
## 24 LAND ROVER       NA   170   170    NA    NA     0     0   170   170
## 25 LEXUS            NA     2     2    NA    NA     0     0     2     2
## 26 MASERATI         NA     6     6    NA    NA     0     0     6     6
## 27 MAZDA            NA   120   120    NA    NA     0     0   120   120
## 28 MERCEDES-BENZ    NA  2324  2324    NA   845   845     0  3169  3169
## 29 MINI             NA   133   133    NA    NA     0     0   133   133
## 30 MITSUBISHI       NA    25    25    24   228   252    24   253   277
## 31 NISSAN           NA  2429  2429    NA   233   233     0  2662  2662
## 32 OPEL             NA  3564  3564    NA    NA     0     0  3564  3564
## 33 OTOKAR           NA    NA     0    NA    NA     0     0     0     0
## 34 PEUGEOT          NA  2412  2412     0   805   805     0  3217  3217
## 35 PORSCHE          NA    37    37    NA    NA     0     0    37    37
## 36 RENAULT        4817  3714  8531    NA  1375  1375  4817  5089  9906
## 37 SEAT             NA  1030  1030    NA    NA     0     0  1030  1030
## 38 SKODA            NA  1937  1937    NA    NA     0     0  1937  1937
## 39 SMART            NA     6     6    NA    NA     0     0     6     6
## 40 SSANGYONG        NA    24    24    NA    49    49     0    73    73
## 41 SUBARU           NA   131   131    NA    NA     0     0   131   131
## 42 SUZUKI           NA   458   458    NA    NA     0     0   458   458
## 43 TOYOTA         2033   501  2534    NA   328   328  2033   829  2862
## 44 VOLKSWAGEN       NA  6957  6957    NA  2405  2405     0  9362  9362
## 45 VOLVO            NA   340   340    NA    NA     0     0   340   340

Some Formatting and Saving RDS File

In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also. One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.

# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_aug_18 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2017,month=8)

print(car_data_aug_18,width=Inf)
## # A tibble: 45 x 12
##    brand_name   auto_dom auto_imp auto_total comm_dom comm_imp comm_total
##    <chr>           <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
##  1 ALFA ROMEO          0       19         19        0        0          0
##  2 ASTON MARTIN        0        1          1        0        0          0
##  3 AUDI                0     1055       1055        0        0          0
##  4 BENTLEY             0        2          2        0        0          0
##  5 BMW                 0     1285       1285        0        0          0
##  6 CHERY               0        0          0        0        0          0
##  7 CITROEN             0     1321       1321        0      682        682
##  8 DACIA               0     3467       3467        0      538        538
##  9 DS                  0        1          1        0        0          0
## 10 FERRARI             0        3          3        0        0          0
##    total_dom total_imp total_total  year month
##        <dbl>     <dbl>       <dbl> <dbl> <dbl>
##  1         0        19          19  2017     8
##  2         0         1           1  2017     8
##  3         0      1055        1055  2017     8
##  4         0         2           2  2017     8
##  5         0      1285        1285  2017     8
##  6         0         0           0  2017     8
##  7         0      2003        2003  2017     8
##  8         0      4005        4005  2017     8
##  9         0         1           1  2017     8
## 10         0         3           3  2017     8
## # ... with 35 more rows
#Save as RDS file
saveRDS(car_data_aug_18,file="C:\\Users\\Efehan\\Documents\\odd_car_sales_data_aug_18.rds")

Some Analysis

After cleaning 0’s and NaN’s we see that 9 out of 45 brands sold cars produced in Turkey while all others produced abroad. Further analysis can be made according to number of cars sold. Secondly I checked which brand with no domestic production sold at most in Turkey.

# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_aug_18 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2017,month=8)
# Let's look for share of domestic production in car sales by brand.First I am adding a new column to see the percentage.
car_data_aug_18 <- car_data_aug_18 %>% mutate(domtotratio = round(total_dom / total_total * 100))

# Since at some months there is no sales at all for certain brands, we got NaN values (standing for not number) which can distort our plot. Below we get rid of those rows which are the brands below.

print(filter(car_data_aug_18,car_data_aug_18$domtotratio == "NaN"))
## # A tibble: 4 x 13
##   brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
##   <chr>         <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
## 1 CHERY             0        0          0        0        0          0
## 2 GEELY             0        0          0        0        0          0
## 3 LAMBORGHI~        0        0          0        0        0          0
## 4 OTOKAR            0        0          0        0        0          0
## # ... with 6 more variables: total_dom <dbl>, total_imp <dbl>,
## #   total_total <dbl>, year <dbl>, month <dbl>, domtotratio <dbl>
car_data_aug_18_withoutnan <- car_data_aug_18[!is.nan(car_data_aug_18$domtotratio),]

#Now we plot our analysis. .
ggplot(car_data_aug_18_withoutnan,aes(x=reorder(brand_name,domtotratio),y=domtotratio / 100))+
  geom_col()+
  coord_flip()+
 geom_text(aes(label=domtotratio),color="orange")+
  scale_y_continuous(labels = percent_format(),limits=c(0,1))+
  ggtitle("Share of Domestic Production In Car Sales At August 2017 By Brand")

# Lastly I would like to see which imported brand without domestic production sold at most in Turkey at August 2017.
car_data_salewithoutdomprod <- car_data_aug_18_withoutnan %>%
arrange(desc(total_total)) %>%
filter(domtotratio == 0) %>%
select(brand_name, total_imp) 

print(car_data_salewithoutdomprod,n=32)
## # A tibble: 32 x 2
##    brand_name    total_imp
##    <chr>             <dbl>
##  1 VOLKSWAGEN         9362
##  2 DACIA              4005
##  3 OPEL               3564
##  4 PEUGEOT            3217
##  5 MERCEDES-BENZ      3169
##  6 NISSAN             2662
##  7 CITROEN            2003
##  8 SKODA              1937
##  9 BMW                1285
## 10 KIA                1129
## 11 AUDI               1055
## 12 SEAT               1030
## 13 SUZUKI              458
## 14 VOLVO               340
## 15 IVECO               221
## 16 JEEP                174
## 17 LAND ROVER          170
## 18 MINI                133
## 19 SUBARU              131
## 20 MAZDA               120
## 21 SSANGYONG            73
## 22 PORSCHE              37
## 23 JAGUAR               34
## 24 ALFA ROMEO           19
## 25 INFINITI             13
## 26 MASERATI              6
## 27 SMART                 6
## 28 FERRARI               3
## 29 BENTLEY               2
## 30 LEXUS                 2
## 31 ASTON MARTIN          1
## 32 DS                    1