First I need to open xlsx file. For this, I need xlsx library. In order to install the this library, I simply use “library(xlsx)” library function code for after usage of this.
library(xlsx)
“read.xlsx()” functions provides to upload data to R platform. By this, I can easily manuplate,change and play with data. With specific second code “1”, I manage the spreadsheet number which is needed for this function requirement.
data <- read.xlsx("automative.xlsx", 1 )
After that cleaning data is a must. For this reason, I must select and delete some columns which have NA values in it. with -c() structure, I include the some rows in data[] body and delete the rows which I don’t want.
data <- data[-c(1:5,51,52),]
head(data)
## NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
## 6 ALFA ROMEO <NA> 13 13 <NA> <NA> 0 0 13 13
## 7 ASTON MARTIN <NA> 1 1 <NA> <NA> 0 0 1 1
## 8 AUDI <NA> 1552 1552 <NA> <NA> 0 0 1552 1552
## 9 BENTLEY <NA> 0 0 <NA> <NA> 0 0 0 0
## 10 BMW <NA> 1586 1586 <NA> <NA> 0 0 1586 1586
## 11 CHERY <NA> <NA> 0 <NA> <NA> 0 0 0 0
Column names are mess. For that, I need to change the column names with reasonable types of names. with a “colnames()” function of R, I can manage and put as variable the names of columns. After that, with a vector, I simply put the required names to this data type.
colnames(data) <- c("brand_name","auto_dom","auto_imp","auto_total","comm_dom","comm_imp","comm_total","total_dom","total_imp","total_total")
Here comes the tidyverse library. Tidyverse library is so intense and powerful in R. So that, I can create chains of functions with %>% pipe operator to mutate some columns as well as I can filter some data in data set.
library(tidyverse)
## -- Attaching packages --------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 3.0.0 <U+221A> purrr 0.2.5
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.6
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
car_data_march_18 <- data %>% mutate_if(is.numeric,funs(ifelse(is.na(.),0,.))) %>% mutate(year=2018,month=3)
print(car_data_march_18,width=Inf)
## brand_name auto_dom auto_imp auto_total comm_dom comm_imp comm_total
## 1 ALFA ROMEO <NA> 13 13 <NA> <NA> 0
## 2 ASTON MARTIN <NA> 1 1 <NA> <NA> 0
## 3 AUDI <NA> 1552 1552 <NA> <NA> 0
## 4 BENTLEY <NA> 0 0 <NA> <NA> 0
## 5 BMW <NA> 1586 1586 <NA> <NA> 0
## 6 CHERY <NA> <NA> 0 <NA> <NA> 0
## 7 CITROEN <NA> 1086 1086 <NA> 706 706
## 8 DACIA <NA> 3600 3600 <NA> 335 335
## 9 DS <NA> 81 81 <NA> <NA> 0
## 10 FERRARI <NA> 2 2 <NA> <NA> 0
## 11 FIAT 4496 142 4638 3392 401 3793
## 12 FORD 412 2638 3050 5217 251 5468
## 13 HONDA 2710 469 3179 <NA> <NA> 0
## 14 HYUNDAI 2013 2376 4389 <NA> 28 28
## 15 INFINITI <NA> 2 2 <NA> <NA> 0
## 16 ISUZU <NA> <NA> 0 123 19 142
## 17 IVECO <NA> <NA> 0 <NA> 161 161
## 18 JAGUAR <NA> 24 24 <NA> <NA> 0
## 19 JEEP <NA> 255 255 <NA> <NA> 0
## 20 KARSAN <NA> <NA> 0 100 <NA> 100
## 21 KIA <NA> 872 872 <NA> 207 207
## 22 LAMBORGHINI <NA> 0 0 <NA> <NA> 0
## 23 LAND ROVER <NA> 162 162 <NA> <NA> 0
## 24 LEXUS <NA> 8 8 <NA> <NA> 0
## 25 MASERATI <NA> 6 6 <NA> <NA> 0
## 26 MAZDA <NA> 135 135 <NA> <NA> 0
## 27 MERCEDES-BENZ <NA> 2031 2031 <NA> 1121 1121
## 28 MINI <NA> 116 116 <NA> <NA> 0
## 29 MITSUBISHI <NA> 21 21 24 403 427
## 30 NISSAN <NA> 2908 2908 <NA> 197 197
## 31 OPEL <NA> 2989 2989 <NA> <NA> 0
## 32 PEUGEOT <NA> 3038 3038 <NA> 485 485
## 33 PORSCHE <NA> 61 61 <NA> <NA> 0
## 34 RENAULT 7127 1821 8948 <NA> 1087 1087
## 35 SEAT <NA> 1226 1226 <NA> <NA> 0
## 36 SKODA <NA> 2911 2911 <NA> <NA> 0
## 37 SMART <NA> 4 4 <NA> <NA> 0
## 38 SSANGYONG <NA> 38 38 <NA> 6 6
## 39 SUBARU <NA> 224 224 <NA> <NA> 0
## 40 SUZUKI <NA> 323 323 <NA> <NA> 0
## 41 TOYOTA 2961 294 3255 <NA> 310 310
## 42 VOLKSWAGEN <NA> 6561 6561 <NA> 1974 1974
## 43 VOLVO <NA> 503 503 <NA> <NA> 0
## 44 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 45 TOPLAM: 19719 40079 59798 8856 7691 16547
## total_dom total_imp total_total year month
## 1 0 13 13 2018 3
## 2 0 1 1 2018 3
## 3 0 1552 1552 2018 3
## 4 0 0 0 2018 3
## 5 0 1586 1586 2018 3
## 6 0 0 0 2018 3
## 7 0 1792 1792 2018 3
## 8 0 3935 3935 2018 3
## 9 0 81 81 2018 3
## 10 0 2 2 2018 3
## 11 7888 543 8431 2018 3
## 12 5629 2889 8518 2018 3
## 13 2710 469 3179 2018 3
## 14 2013 2404 4417 2018 3
## 15 0 2 2 2018 3
## 16 123 19 142 2018 3
## 17 0 161 161 2018 3
## 18 0 24 24 2018 3
## 19 0 255 255 2018 3
## 20 100 0 100 2018 3
## 21 0 1079 1079 2018 3
## 22 0 0 0 2018 3
## 23 0 162 162 2018 3
## 24 0 8 8 2018 3
## 25 0 6 6 2018 3
## 26 0 135 135 2018 3
## 27 0 3152 3152 2018 3
## 28 0 116 116 2018 3
## 29 24 424 448 2018 3
## 30 0 3105 3105 2018 3
## 31 0 2989 2989 2018 3
## 32 0 3523 3523 2018 3
## 33 0 61 61 2018 3
## 34 7127 2908 10035 2018 3
## 35 0 1226 1226 2018 3
## 36 0 2911 2911 2018 3
## 37 0 4 4 2018 3
## 38 0 44 44 2018 3
## 39 0 224 224 2018 3
## 40 0 323 323 2018 3
## 41 2961 604 3565 2018 3
## 42 0 8535 8535 2018 3
## 43 0 503 503 2018 3
## 44 <NA> <NA> <NA> 2018 3
## 45 28575 47770 76345 2018 3
I must save the data in a safe folder to continue my progress. In order to do this, “saveRDS()” function helps me.
saveRDS(car_data_march_18,file="C:\\Users\\Baris\\Desktop\\Big Data Analytics MEF 2018-2019\\TERM1\\Data Anaytics Essential\\odd_car_sales_data_sep_18.rds")
Again with tidyverse I filter, select and arrange the columns of data set. Filter works with boolean operatos and select() just selects the required columns which I described. arrange() simply arranges the data key to described column in it.
car_data_march_18 %>%
filter(as.numeric(auto_total) > 0 & as.numeric(comm_total) > 0) %>%
select(brand_name,total_total) %>%
arrange(desc(total_total))
## brand_name total_total
## 1 VOLKSWAGEN 8535
## 2 FORD 8518
## 3 FIAT 8431
## 4 DS 81
## 5 LEXUS 8
## 6 TOPLAM: 76345
## 7 PORSCHE 61
## 8 MASERATI 6
## 9 VOLVO 503
## 10 MITSUBISHI 448
## 11 HYUNDAI 4417
## 12 SSANGYONG 44
## 13 SMART 4
## 14 DACIA 3935
## 15 TOYOTA 3565
## 16 PEUGEOT 3523
## 17 SUZUKI 323
## 18 HONDA 3179
## 19 MERCEDES-BENZ 3152
## 20 NISSAN 3105
## 21 OPEL 2989
## 22 SKODA 2911
## 23 JEEP 255
## 24 JAGUAR 24
## 25 SUBARU 224
## 26 FERRARI 2
## 27 INFINITI 2
## 28 CITROEN 1792
## 29 LAND ROVER 162
## 30 IVECO 161
## 31 BMW 1586
## 32 AUDI 1552
## 33 ISUZU 142
## 34 MAZDA 135
## 35 ALFA ROMEO 13
## 36 SEAT 1226
## 37 MINI 116
## 38 KIA 1079
## 39 RENAULT 10035
## 40 KARSAN 100
## 41 ASTON MARTIN 1
## 42 BENTLEY 0
## 43 CHERY 0
## 44 LAMBORGHINI 0
An lastly, I filter and original table with filter, select, mutate and arrange. First I filter numeric data if these variables which I described are positive. After that, I calculate differences of comm_total and auto_total and values which comm_total bigger than auto_total. After I select the mutated values and arrange them from biggest to lowest.
car_data_march_18 %>%
filter(as.numeric(auto_total) > 0 & as.numeric(comm_total) > 0) %>%
mutate(CommAutoDif = as.numeric(comm_total) >as.numeric(auto_total),CommAutoDif2 = as.numeric(auto_total) - as.numeric(comm_total)) %>%
select(brand_name,total_total,CommAutoDif,CommAutoDif2) %>%
arrange(desc(CommAutoDif2))
## brand_name total_total CommAutoDif CommAutoDif2
## 1 DS 81 FALSE 35
## 2 RENAULT 10035 FALSE 35
## 3 LEXUS 8 FALSE 34
## 4 PORSCHE 61 FALSE 32
## 5 MASERATI 6 FALSE 31
## 6 VOLVO 503 FALSE 29
## 7 KIA 1079 FALSE 27
## 8 SMART 4 FALSE 26
## 9 VOLKSWAGEN 8535 FALSE 25
## 10 TOPLAM: 76345 FALSE 24
## 11 SUZUKI 323 FALSE 22
## 12 HONDA 3179 FALSE 21
## 13 OPEL 2989 FALSE 18
## 14 HYUNDAI 4417 FALSE 17
## 15 SKODA 2911 FALSE 17
## 16 FIAT 8431 FALSE 15
## 17 JEEP 255 FALSE 15
## 18 JAGUAR 24 FALSE 14
## 19 SUBARU 224 FALSE 13
## 20 DACIA 3935 FALSE 12
## 21 TOYOTA 3565 FALSE 12
## 22 FERRARI 2 FALSE 10
## 23 INFINITI 2 FALSE 10
## 24 LAND ROVER 162 FALSE 9
## 25 NISSAN 3105 FALSE 9
## 26 BMW 1586 FALSE 8
## 27 MERCEDES-BENZ 3152 FALSE 8
## 28 SSANGYONG 44 FALSE 8
## 29 AUDI 1552 FALSE 7
## 30 MAZDA 135 FALSE 6
## 31 ALFA ROMEO 13 FALSE 5
## 32 FORD 8518 FALSE 4
## 33 PEUGEOT 3523 FALSE 4
## 34 SEAT 1226 FALSE 4
## 35 MINI 116 FALSE 3
## 36 ASTON MARTIN 1 FALSE 1
## 37 BENTLEY 0 FALSE 0
## 38 CHERY 0 FALSE 0
## 39 LAMBORGHINI 0 FALSE 0
## 40 KARSAN 100 TRUE -1
## 41 MITSUBISHI 448 TRUE -2
## 42 ISUZU 142 TRUE -4
## 43 IVECO 161 TRUE -5
## 44 CITROEN 1792 TRUE -16