In week-2 we studied with tidyverse package and some functions in this package. This homework is about it. After we downloaded data from ODD official website, we changed the name to ODD_Retail_Sales_201701.xlsx. We will make some example from raw data to final analysis below.
Our raw excel file is in our repository. We can automatically download that file and put it in a temporary file. Then we can read that excel document into R and remove the temp file.
# Download file from repository to the temp file
# Remove the temp file
tmp<-tempfile(fileext=".xlsx")
download.file("https://github.com/MEF-BDA503/pj18-ucarsal/blob/master/week2/odd_retail_sales_2017_01.xlsx?raw=true",destfile=tmp,mode = 'wb')
raw_data<-readxl::read_excel(tmp,skip=7,col_names=FALSE)
file.remove(tmp)
## [1] TRUE
head(raw_data)
## # A tibble: 6 x 10
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO NA 17 17 NA NA 0 0 17 17
## 2 ASTON MARTIN NA 1 1 NA NA 0 0 1 1
## 3 AUDI NA 625 625 NA NA 0 0 625 625
## 4 BENTLEY NA 1 1 NA NA 0 0 1 1
## 5 BMW NA 570 570 NA NA 0 0 570 570
## 6 CHERY NA 0 0 NA NA 0 0 0 0
It’s ok but needs some work.
In order to make the data standardized and workable we need to define column names and remove NA values for this example. Please use the same column names in your examples also.
#Firstly we should specify the library to use.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## <U+221A> ggplot2 3.0.0 <U+221A> purrr 0.2.5
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.6
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Use the same column names in your data.
colnames(raw_data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
# Now we replace NA values with 0 and label the time period with year and month, so when we merge the data we won't be confused.
car_data_jan_17 <- raw_data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2017,month=1)
print(car_data_jan_17,width=Inf)
## # A tibble: 45 x 12
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALFA ROMEO 0 17 17 0 0 0
## 2 ASTON MARTIN 0 1 1 0 0 0
## 3 AUDI 0 625 625 0 0 0
## 4 BENTLEY 0 1 1 0 0 0
## 5 BMW 0 570 570 0 0 0
## 6 CHERY 0 0 0 0 0 0
## 7 CITROEN 0 515 515 3 438 441
## 8 DACIA 0 1471 1471 0 235 235
## 9 DS 0 9 9 0 0 0
## 10 FERRARI 0 0 0 0 0 0
## total_dom total_imp total_total year month
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 17 17 2017 1
## 2 0 1 1 2017 1
## 3 0 625 625 2017 1
## 4 0 1 1 2017 1
## 5 0 570 570 2017 1
## 6 0 0 0 2017 1
## 7 3 953 956 2017 1
## 8 0 1706 1706 2017 1
## 9 0 9 9 2017 1
## 10 0 0 0 2017 1
## # ... with 35 more rows
One of the best methods is to save your data to an RDS or RData file. The difference is RDS can hold only one object but RData can hold many. Since we have only one data frame here we will go with RDS.
saveRDS(car_data_jan_17,file="odd_car_sales_data_jan_17.rds")
# You can read that file by readRDS and assigning to an object
I just have a look the percentage of commercial vehicle sales over total sales. And I ordered the dataframe according to total commercial sales brand by brand.
# A new column is added named as perc_comm. That is the percentage of commercial sales to total sales.
car_data_jan_17<-car_data_jan_17%>%mutate(perc_comm=(comm_total/total_total)*100)
#How to select columns.
car_data_jan_17%>%
filter(comm_total>0)%>%
select(year,month,brand_name,comm_total,total_total,perc_comm)%>%
arrange(desc(comm_total))
## # A tibble: 17 x 6
## year month brand_name comm_total total_total perc_comm
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2017 1 FORD 2978 4511 66.0
## 2 2017 1 FIAT 2245 3866 58.1
## 3 2017 1 VOLKSWAGEN 1255 4314 29.1
## 4 2017 1 RENAULT 519 4874 10.6
## 5 2017 1 CITROEN 441 956 46.1
## 6 2017 1 TOYOTA 356 2283 15.6
## 7 2017 1 MERCEDES-BENZ 343 842 40.7
## 8 2017 1 DACIA 235 1706 13.8
## 9 2017 1 KIA 193 619 31.2
## 10 2017 1 MITSUBISHI 190 218 87.2
## 11 2017 1 ISUZU 173 173 100
## 12 2017 1 PEUGEOT 168 805 20.9
## 13 2017 1 NISSAN 167 1321 12.6
## 14 2017 1 IVECO 146 146 100
## 15 2017 1 HYUNDAI 134 2357 5.69
## 16 2017 1 KARSAN 82 82 100
## 17 2017 1 SSANGYONG 9 19 47.4