From Raw to Civilized Data First we find the data on Otomotiv Distibütörleri Derneği website. We are interested in April 2016 sales. We download the data change the name to odd_retail_sales_2016_04.xlsx . We will make a reproducible example of data analysis from the raw data located somewhere to the final analysis. Download Raw Data Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.
tmp<-tempfile(fileext=".xlsx")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/pj18-yildizmust/blob/master/odd_retail_sales_2016_04.xlsx?raw=true",destfile=tmp,mode='wb')
# Read that excel file using readxl package's read_excel function. You might need to adjust the parameters (skip, col_names) according to your raw file's format.
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
# Remove the temp file
file.remove(tmp)
## [1] TRUE
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## <U+221A> ggplot2 3.1.0 <U+221A> purrr 0.2.5
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.6
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Remove the last two rows because they are irrelevant (total and empty rows)
raw_data <- raw_data %>% slice(-c(49,50))
# Let's see our raw data
head(raw_data)
## # A tibble: 6 x 10
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO NA 58 58 NA NA 0 0 58 58
## 2 ASTON MARTIN NA 4 4 NA NA 0 0 4 4
## 3 AUDI NA 2148 2148 NA NA 0 0 2148 2148
## 4 BENTLEY NA 0 0 NA NA 0 0 0 0
## 5 BMW NA 2332 2332 NA NA 0 0 2332 2332
## 6 CHERY NA 9 9 NA NA 0 0 9 9
Make Data Civilized In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also.
# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_apr_16 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2016,month=4)
print(car_data_apr_16,width=Inf)
## # A tibble: 48 x 12
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO 0 58 58 0 0 0
## 2 ASTON MARTIN 0 4 4 0 0 0
## 3 AUDI 0 2148 2148 0 0 0
## 4 BENTLEY 0 0 0 0 0 0
## 5 BMW 0 2332 2332 0 0 0
## 6 CHERY 0 9 9 0 0 0
## 7 CITROEN 0 1684 1684 95 670 765
## 8 DACIA 0 3555 3555 0 512 512
## 9 DS 0 35 35 0 0 0
## 10 FERRARI 0 4 4 0 0 0
## total_dom total_imp total_total year month
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 58 58 2016 4
## 2 0 4 4 2016 4
## 3 0 2148 2148 2016 4
## 4 0 0 0 2016 4
## 5 0 2332 2332 2016 4
## 6 0 9 9 2016 4
## 7 95 2354 2449 2016 4
## 8 0 4067 4067 2016 4
## 9 0 35 35 2016 4
## 10 0 4 4 2016 4
## # ... with 38 more rows
Save Your Civilized Data One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.
saveRDS(car_data_apr_16,file="C:/Users/LENOVO/Desktop/odd_car_sales_data_apr_16.rds")
# You can read that file by readRDS and assigning to an object
# e.g
# rds_data <- readRDS("C:\Users\LENOVO\Desktop/odd_car_sales_data_apr_16.rds")
Finish With Some Analysis You are free to make any analysis here. I wanted to see a list of total sales of brands with both automobile and commercial vehicle sales ordered in decreasing total sales.
car_data_apr_16 %>%
filter(auto_total > 0 & comm_total > 0) %>%
select(brand_name,total_total) %>%
arrange(desc(total_total))
## # A tibble: 14 x 2
## brand_name total_total
## <chr> <dbl>
## 1 RENAULT 12075
## 2 VOLKSWAGEN 10485
## 3 FIAT 9521
## 4 FORD 9151
## 5 HYUNDAI 4608
## 6 TOYOTA 4474
## 7 DACIA 4067
## 8 MERCEDES-BENZ 2976
## 9 NISSAN 2810
## 10 PEUGEOT 2571
## 11 CITROEN 2449
## 12 KIA 1785
## 13 MITSUBISHI 414
## 14 SSANGYONG 53