From Raw to Civilized Data

Car sales data downloaded from “Otomotiv Distributorleri Dernegi” website. My part was to obtain January 2016 sales and name of the excel file changed to “odd_retail_sales_2016_01.xlsx”.

Download Raw Data

My raw excel file is in my repository. I downloaded that file and put it in a temporary file. Then I read that excel document into R and removed the temp file. Last two rows also deleted from the dataset since it consists of total sales info which is unnecessary. Then I showed first 10 rows of the dataset.

tmp<-tempfile(fileext=".xlsx")
download.file("https://github.com/MEF-BDA503/pj18-mkerimacar/blob/master/odd_retail_sales_2016_01.xlsx?raw=true",destfile=tmp)
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
file.remove(tmp)
## [1] TRUE
raw_data <- raw_data %>% slice(-c(49,50))
head(raw_data,10)
## # A tibble: 10 x 10
##    X__1               X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10
##    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ALFA ROMEO           NA    12    12    NA    NA     0     0    12    12
##  2 "ASTON MART\u013~    NA     2     2    NA    NA     0     0     2     2
##  3 AUDI                 NA   911   911    NA    NA     0     0   911   911
##  4 BENTLEY              NA    NA     0    NA    NA     0     0     0     0
##  5 BMW                  NA   496   496    NA    NA     0     0   496   496
##  6 CHERY                NA    30    30    NA    NA     0     0    30    30
##  7 CITROEN              NA   394   394    41   207   248    41   601   642
##  8 DACIA                NA  1235  1235    NA   221   221     0  1456  1456
##  9 DS                   NA     8     8    NA    NA     0     0     8     8
## 10 FERRARI              NA     3     3    NA    NA     0     0     3     3

Make Data Civilized

In order to make the data standardized and workable I renamed column names and removed NA values from the dataset.

colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
car_data_jan_16 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2016,month=1)
print(car_data_jan_16,width=Inf)
## # A tibble: 48 x 12
##    brand_name          auto_dom auto_imp auto_total comm_dom comm_imp
##    <chr>                  <dbl>    <dbl>      <dbl>    <dbl>    <dbl>
##  1 ALFA ROMEO                 0       12         12        0        0
##  2 "ASTON MART\u0130N"        0        2          2        0        0
##  3 AUDI                       0      911        911        0        0
##  4 BENTLEY                    0        0          0        0        0
##  5 BMW                        0      496        496        0        0
##  6 CHERY                      0       30         30        0        0
##  7 CITROEN                    0      394        394       41      207
##  8 DACIA                      0     1235       1235        0      221
##  9 DS                         0        8          8        0        0
## 10 FERRARI                    0        3          3        0        0
##    comm_total total_dom total_imp total_total  year month
##         <dbl>     <dbl>     <dbl>       <dbl> <dbl> <dbl>
##  1          0         0        12          12  2016     1
##  2          0         0         2           2  2016     1
##  3          0         0       911         911  2016     1
##  4          0         0         0           0  2016     1
##  5          0         0       496         496  2016     1
##  6          0         0        30          30  2016     1
##  7        248        41       601         642  2016     1
##  8        221         0      1456        1456  2016     1
##  9          0         0         8           8  2016     1
## 10          0         0         3           3  2016     1
## # ... with 38 more rows

Relative Frequencies

I wanted to see relative frequencies of total sales and I created a column named as total_relfreqs next to total sales (rounded to 3 decimals).

car_data_jan_16 %>% 
  filter(auto_total > 0 & comm_total > 0) %>%
  select(brand_name,total_total) %>%
  arrange(desc(total_total)) %>%
  mutate(total_relfreqs = (total_total/sum(total_total))) %>%
  mutate(total_relfreqs = sprintf("%0.3f", total_relfreqs))
## # A tibble: 13 x 3
##    brand_name    total_total total_relfreqs
##    <chr>               <dbl> <chr>         
##  1 VOLKSWAGEN           4528 0.175         
##  2 RENAULT              4519 0.174         
##  3 FIAT                 3843 0.148         
##  4 FORD                 3770 0.145         
##  5 HYUNDAI              2301 0.089         
##  6 TOYOTA               1481 0.057         
##  7 DACIA                1456 0.056         
##  8 MERCEDES-BENZ        1451 0.056         
##  9 PEUGEOT              1111 0.043         
## 10 CITROEN               642 0.025         
## 11 KIA                   418 0.016         
## 12 MITSUBISHI            388 0.015         
## 13 SSANGYONG              22 0.001