From Raw to Civilized Data

First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in September 2018 sales. We download the data change the name to odd_retail_sales_2018_09.xlsx (change yours accordingly). We will make a reproducible example of data analysis from the raw data located somewhere to the final analysis.

Download Raw Data

Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.

# Create a temporary file
tmp<-tempfile(fileext=".xlsx")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/mef-bda503.github.io/blob/master/files/odd_retail_sales_2018_09.xlsx?raw=true",destfile=tmp)
# Read that excel file using readxl package's read_excel function. You might need to adjust the parameters (skip, col_names) according to your raw file's format.
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
# Remove the temp file
file.remove(tmp)

## [1] TRUE

# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data %>% slice(-c(43,44))

# Let's see our raw data
head(raw_data)

## # A tibble: 6 x 10
##   X__1          X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO      NA    13    13    NA    NA     0     0    13    13
## 2 ASTON MARTIN    NA     2     2    NA    NA     0     0     2     2
## 3 AUDI            NA   350   350    NA    NA     0     0   350   350
## 4 BENTLEY         NA     0     0    NA    NA     0     0     0     0
## 5 BMW             NA   158   158    NA    NA     0     0   158   158
## 6 CITROEN         NA   134   134    NA   197   197     0   331   331

It’s ok but needs some work.

Make Data Civilized

In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also.

# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_sep_18 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2018,month=9)

print(car_data_sep_18,width=Inf)

## # A tibble: 42 x 12
##    brand_name   auto_dom auto_imp auto_total comm_dom comm_imp comm_total
##    <chr>           <dbl>    <dbl>      <dbl>    <dbl>    <dbl>      <dbl>
##  1 ALFA ROMEO          0       13         13        0        0          0
##  2 ASTON MARTIN        0        2          2        0        0          0
##  3 AUDI                0      350        350        0        0          0
##  4 BENTLEY             0        0          0        0        0          0
##  5 BMW                 0      158        158        0        0          0
##  6 CITROEN             0      134        134        0      197        197
##  7 DACIA               0     1141       1141        0      319        319
##  8 DS                  0        9          9        0        0          0
##  9 FERRARI             0        3          3        0        0          0
## 10 FIAT              632       57        689      789      199        988
##    total_dom total_imp total_total  year month
##        <dbl>     <dbl>       <dbl> <dbl> <dbl>
##  1         0        13          13  2018     9
##  2         0         2           2  2018     9
##  3         0       350         350  2018     9
##  4         0         0           0  2018     9
##  5         0       158         158  2018     9
##  6         0       331         331  2018     9
##  7         0      1460        1460  2018     9
##  8         0         9           9  2018     9
##  9         0         3           3  2018     9
## 10      1421       256        1677  2018     9
## # ... with 32 more rows

Save Your Civilized Data

One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.

saveRDS(car_data_sep_18,file="~/YOUR_OWN_PATH/odd_car_sales_data_sep_18.rds")
# You can read that file by readRDS and assigning to an object 
# e.g 
# rds_data <- readRDS("~/YOUR_OWN_PATH/odd_car_sales_data_sep_18.rds")

Finish With Some Analysis

You are free to make any analysis here. I wanted to see a list of total sales of brands with both automobile and commercial vehicle sales ordered in decreasing total sales.

car_data_sep_18 %>% 
  filter(auto_total > 0 & comm_total > 0) %>%
  select(brand_name,total_total) %>%
  arrange(desc(total_total))

## # A tibble: 13 x 2
##    brand_name    total_total
##    <chr>               <dbl>
##  1 RENAULT              3186
##  2 FORD                 2356
##  3 VOLKSWAGEN           2239
##  4 FIAT                 1677
##  5 HYUNDAI              1535
##  6 DACIA                1460
##  7 NISSAN               1217
##  8 MERCEDES-BENZ        1163
##  9 TOYOTA               1127
## 10 PEUGEOT               717
## 11 CITROEN               331
## 12 KIA                   311
## 13 MITSUBISHI            213

Remarks

When you are doing your example take this document as a canvas.
Use the exact same column names and don’t forget to add year and month of your data.
Create rds files and email them to the aggregator.
Your final analysis can be anything you are curious about.
If you are not sure about the functions always check the help with ?functionname command.
You can collaborate but your work should be your own. Do not attempt to other students’ months until the deadline.
You can eliminate the Remarks and Saving Civilized Data sections.

Car Sales Data Import - Instructor Example

Instructor

10/17/2018