From Raw to Civilized Data

First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in September 2018 sales. We download the data change the name to odd_retail_sales_2016_03.xlsx (change yours accordingly). We will make a reproducible example of data analysis from the raw data located somewhere to the final analysis.

Download Raw Data

Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.

# Create a temporary file
tmp<-tempfile(fileext=".xlsx")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/pj18-mrtgocer/blob/master/files/odd_retail_sales_2016_03.xlsx?raw=true",mode = 'wb',destfile=tmp)
# Read that excel file using readxl package's read_excel function. You might need to adjust the parameters (skip, col_names) according to your raw file's format.
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
# Remove the temp file
file.remove(tmp)
## [1] TRUE
# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data %>% slice(-c(43,44))

# Let's see our raw data
head(raw_data)
## # A tibble: 6 x 10
##   X__1          X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO      NA   149   149    NA    NA     0     0   149   149
## 2 ASTON MARTIN    NA     2     2    NA    NA     0     0     2     2
## 3 AUDI            NA  1455  1455    NA    NA     0     0  1455  1455
## 4 BENTLEY         NA    NA     0    NA    NA     0     0     0     0
## 5 BMW             NA  2080  2080    NA    NA     0     0  2080  2080
## 6 CHERY           NA    12    12    NA    NA     0     0    12    12

It’s ok but needs some work.

Make Data Civilized

In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also.

# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_march_16 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2016,month=3)

print(car_data_march_16,width=Inf)
## # A tibble: 48 x 12
##    brand_name   auto_dom auto_imp auto_total comm_dom comm_imp comm_total
##    <chr>           <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
##  1 ALFA ROMEO          0      149        149        0        0          0
##  2 ASTON MARTIN        0        2          2        0        0          0
##  3 AUDI                0     1455       1455        0        0          0
##  4 BENTLEY             0        0          0        0        0          0
##  5 BMW                 0     2080       2080        0        0          0
##  6 CHERY               0       12         12        0        0          0
##  7 CITROEN             0     1562       1562      127      752        879
##  8 DACIA               0     3402       3402        0      380        380
##  9 DS                  0       42         42        0        0          0
## 10 FERRARI             0        0          0        0        0          0
##    total_dom total_imp total_total  year month
##        <dbl>     <dbl>       <dbl> <dbl> <dbl>
##  1         0       149         149  2016     3
##  2         0         2           2  2016     3
##  3         0      1455        1455  2016     3
##  4         0         0           0  2016     3
##  5         0      2080        2080  2016     3
##  6         0        12          12  2016     3
##  7       127      2314        2441  2016     3
##  8         0      3782        3782  2016     3
##  9         0        42          42  2016     3
## 10         0         0           0  2016     3
## # ... with 38 more rows

Save Your Civilized Data

One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.

saveRDS(car_data_march_16,file="/home/murat/Masaüstü/pj18-mrtgocer/files/odd_car_sales_data_march_16.rds")
# You can read that file by readRDS and assigning to an object 
# e.g 
# rds_data <- readRDS("~/YOUR_OWN_PATH/odd_car_sales_data_march_16.rds")

Finish With Some Analysis

You are free to make any analysis here. I wanted to see a list of total sales of brands with both automobile and commercial vehicle sales ordered in decreasing total sales.

car_data_march_16 %>% 
  filter(auto_total > 0 & comm_total > 0) %>%
  select(brand_name,total_total) %>%
  arrange(desc(total_total))
## # A tibble: 15 x 2
##    brand_name    total_total
##    <chr>               <dbl>
##  1 TOPLAM:             82948
##  2 RENAULT             11293
##  3 VOLKSWAGEN          11293
##  4 FORD                 9758
##  5 FIAT                 8400
##  6 HYUNDAI              4567
##  7 DACIA                3782
##  8 TOYOTA               3630
##  9 MERCEDES-BENZ        3405
## 10 NISSAN               3164
## 11 PEUGEOT              3159
## 12 CITROEN              2441
## 13 KIA                  1467
## 14 MITSUBISHI            436
## 15 SSANGYONG              53