Serhan SÜER

From Raw to Civilized Data

First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in January 2018 sales. We download the data change the name to odd_retail_sales_2018_01.xlsx.

Download Raw Data

Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.

# Create a temporary file
tmp<-tempfile(fileext=".xlsx")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/pj18-SerhanSuer/blob/master/odd_retail_sales_2018_01.xlsx?raw=true",destfile=tmp)
# Read that excel file using readxl package's read_excel function. You might need to adjust the parameters (skip, col_names) according to your raw file's format.
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
# Remove the temp file
file.remove(tmp)
## [1] TRUE
# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data %>% slice(-c(43,44))
# Let's see our raw data
head(raw_data)
## # A tibble: 6 x 10
##   X__1          X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO      NA     5     5    NA    NA     0     0     5     5
## 2 ASTON MARTIN    NA     1     1    NA    NA     0     0     1     1
## 3 AUDI            NA  1013  1013    NA    NA     0     0  1013  1013
## 4 BENTLEY         NA     0     0    NA    NA     0     0     0     0
## 5 BMW             NA   626   626    NA    NA     0     0   626   626
## 6 CHERY           NA    NA     0    NA    NA     0     0     0     0

Make Data Civilized

In order to make the data standardized and workable we need to define column names and remove NA values for this example.

# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_jan_18 <- raw_data %>% 
    mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% 
    mutate(year=2018,month=1)

print(car_data_jan_18,width=Inf)
## # A tibble: 43 x 12
##    brand_name   auto_dom auto_imp auto_total comm_dom comm_imp comm_total
##    <chr>           <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
##  1 ALFA ROMEO          0        5          5        0        0          0
##  2 ASTON MARTIN        0        1          1        0        0          0
##  3 AUDI                0     1013       1013        0        0          0
##  4 BENTLEY             0        0          0        0        0          0
##  5 BMW                 0      626        626        0        0          0
##  6 CHERY               0        0          0        0        0          0
##  7 CITROEN             0      407        407        0       94         94
##  8 DACIA               0      937        937        0      254        254
##  9 DS                  0        0          0        0        0          0
## 10 FERRARI             0        1          1        0        0          0
##    total_dom total_imp total_total  year month
##        <dbl>     <dbl>       <dbl> <dbl> <dbl>
##  1         0         5           5  2018     1
##  2         0         1           1  2018     1
##  3         0      1013        1013  2018     1
##  4         0         0           0  2018     1
##  5         0       626         626  2018     1
##  6         0         0           0  2018     1
##  7         0       501         501  2018     1
##  8         0      1191        1191  2018     1
##  9         0         0           0  2018     1
## 10         0         1           1  2018     1
## # ... with 33 more rows

Finish With Some Analysis

I wanted to make four analysis related to this dataset.

  1. Firstly, I wanted to see a list of total sales of brands with both automobile and commercial vehicle sales ordered only in domain sector in decreasing total sales.
car_data_jan_18 %>% 
    filter(total_imp == 0 & total_total != 0) %>%
    select(brand_name,total_total) %>%
    arrange(desc(total_total))
## # A tibble: 1 x 2
##   brand_name total_total
##   <chr>            <dbl>
## 1 KARSAN              49
  1. Then I wanted to see a list of total sales of brands with both automobile and commercial vehicle sales ordered only in import sector in decreasing total sales.
car_data_jan_18 %>% 
    filter(total_dom == 0 & total_total != 0) %>%
    select(brand_name,total_total) %>%
    arrange(desc(total_total))
## # A tibble: 28 x 2
##    brand_name    total_total
##    <chr>               <dbl>
##  1 VOLKSWAGEN           4127
##  2 MERCEDES-BENZ        1670
##  3 SKODA                1564
##  4 OPEL                 1327
##  5 DACIA                1191
##  6 AUDI                 1013
##  7 PEUGEOT               987
##  8 NISSAN                862
##  9 SEAT                  663
## 10 BMW                   626
## # ... with 18 more rows
  1. Then I wanted to see a list of total sales of brands with only automobile vehicle sales ordered in decreasing total sales.
car_data_jan_18 %>% 
    filter(comm_total == 0 & total_total != 0) %>%
    select(brand_name,total_total) %>%
    arrange(desc(total_total))
## # A tibble: 20 x 2
##    brand_name   total_total
##    <chr>              <dbl>
##  1 SKODA               1564
##  2 HONDA               1398
##  3 OPEL                1327
##  4 AUDI                1013
##  5 SEAT                 663
##  6 BMW                  626
##  7 JEEP                 101
##  8 SUBARU                86
##  9 LAND ROVER            84
## 10 SUZUKI                67
## 11 MAZDA                 62
## 12 MINI                  55
## 13 PORSCHE               43
## 14 JAGUAR                12
## 15 ALFA ROMEO             5
## 16 MASERATI               4
## 17 INFINITI               3
## 18 LEXUS                  2
## 19 ASTON MARTIN           1
## 20 FERRARI                1
  1. Then I wanted to see a list of total sales of brands with only commercial vehicle sales ordered in decreasing total sales.
car_data_jan_18 %>% 
    filter(auto_total == 0 & total_total != 0) %>%
    select(brand_name,total_total) %>%
    arrange(desc(total_total))
## # A tibble: 3 x 2
##   brand_name total_total
##   <chr>            <dbl>
## 1 IVECO              122
## 2 ISUZU              121
## 3 KARSAN              49