According to Forbes, the industry has grown by a massive 68% since hitting a trough during the 2009 global financial crises according to a report published by car auction company Manheim earlier this year.
Q3 2016 closed with 9.8M vehicles sold in the used market -an increase of 3.3% over the previous year. Also the average retail used vehicle sold for $19.232 in Q3 2016, an increase of 4.3% over last year. This increase in overall values is fueled by the younger age (4.0 years on average) of vehicles sold at franchise dealers.
Changes in new car buying behavior are beginning to alter the landscape of franchised used vehicles. full article
So both franchised used car firms and other giant online marketplaces like E-Bay are leveraging the growth rate of used car industy. As a result, we tried to understand this market and its dynamics with the help of ‘Used Car Database’.
Our report explores Kaggle’s ‘Used Car Database’ scraped from Ebay-Kleinanzeigen (in German) containing prices and attributes for approximately 370.000 second-hand cars of 40 unique brands. Data contains ads that created between March 2015 and April 2016. Also the data crawled between March 5, 2016 and April 7, 2016.
Our aim is to understand this growing used car market. When we look at used cars, price is the most important factor that influences opinions, but there are also few other facts that affect the purchase decision. So we tried to find out which variables affect the price most and how they do it.
The used car database not only provides vast amount of observations but also presents several variables for exploratory data analysis. We removed some variables, generated others, cleaned the data, formed upper and lower limits, justified some of our initial guesses while failing in others. In addition to several univariate, bivariate and multivariate graphs, we ran a linear regression analysis and worked on maps to make sense of the zipcode.
First of all, we reduced the number of variables from 20 to 15 to explore the dataset. The variables ‘Name’, ‘Offer Type’, ‘Seller’, ‘Number of Pictures’ and ‘AB Test’ are eliminated from the analysis based on sum() and table() queries. According to our analysis, these variables have no significant impact on explaining the data. For example, vast majority of the second-hand car ads have pictures (no distinction) while names of the ads were difficult to analyze (user generated content).
Apart from the elimination, we refined and standardized the variables which include date and time) ‘Date Crawled’, ‘Date Created’, ‘Last Seen’) by “lubridate” package.
That makes the First part of data cleaning is complete.
library(data.table)
library(ggplot2)
library(lubridate)
library(zipcode)
library(dplyr)
setwd("~/bda503/project")
auto <- fread("autos.csv", stringsAsFactors = T)
str(auto)
## Classes 'data.table' and 'data.frame': 371824 obs. of 20 variables:
## $ dateCrawled : Factor w/ 280652 levels "2016-03-05 14:06:22",..: 164643 164374 76603 106256 231281 271106 243057 141578 275651 103170 ...
## $ name : Factor w/ 233701 levels "!!!!!!!!!!!!!!!!!!!!!!!!__GOLF_3_CABRIO___!!!!!!!!!!!!!!!!!!!!!!!",..: 79317 4177 90686 76703 170641 27425 145443 188508 64303 190556 ...
## $ seller : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 9 levels "","andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 252 levels "","100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int 0 5 8 6 7 10 8 7 8 0 ...
## $ fuelType : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
## $ dateCreated : Factor w/ 114 levels "2014-03-10 00:00:00",..: 100 100 90 93 107 111 108 97 111 93 ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : Factor w/ 182904 levels "2016-03-05 14:15:08",..: 178938 178454 163054 45304 172012 175601 165721 89623 157785 126444 ...
## - attr(*, ".internal.selfref")=<externalptr>
auto$offerType <- NULL
auto$name <- NULL
auto$seller <- NULL
auto$nrOfPictures <- NULL
auto$abtest <- NULL
auto$dateCrawled <- ymd_hms(auto$dateCrawled)
auto$dateCreated <- ymd_hms(auto$dateCreated)
auto$lastSeen <- ymd_hms(auto$lastSeen)
We transformed the price by correcting outliers with quantile(). After experimenting with the limits, we decided that “quantile(autoprice, 0.95)” is the best of the both worlds and the data cleaned accordingly. Then, we started to explore what might affect the second-hand cars’ prices.
summary(auto$price)
quantile(auto$price, 0.90)
quantile(auto$price, 0.99)
quantile(auto$price, 0.01)
quantile(auto$price, 0.05)
quantile(auto$price, 0.10)
ggplot(aes(x=vehicleType, y=price), data = auto) +
geom_boxplot() +
ylim(quantile(auto$price, 0.05), quantile(auto$price, 0.95))
p1 <- ggplot(aes(x="price", y=price), data = auto) +
geom_boxplot()
p2 <- ggplot(aes(x="price", y=price), data = auto) +
geom_boxplot() +
ylim(0, quantile(auto$price, 0.99))
p3 <- ggplot(aes(x="price", y=price), data = auto) +
geom_boxplot() +
ylim(0, quantile(auto$price, 0.95))
p4 <- ggplot(aes(x="price", y=price), data = auto) +
geom_boxplot() +
ylim(0, quantile(auto$price, 0.90))
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
auto <- auto[(price > quantile(auto$price, 0.05)) & (price < quantile(auto$price, 0.95))]
ggplot(aes(x=vehicleType, y=price), data = auto) +
geom_boxplot()
We observed that the engine power (PowerPS) column included extreme outliers, therefore after our first look we corrected them with quantile(). The sample value of engine power is set as NA when the value is lower than 40.
summary(auto$powerPS)
p1 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) +
geom_boxplot()
p2 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) +
geom_boxplot() +
ylim(quantile(auto$powerPS, 0.05), quantile(auto$powerPS, 0.95))
grid.arrange(p1, p2, ncol = 1)
auto[(powerPS < quantile(powerPS, 0.05)) | (powerPS > quantile(powerPS, 0.95)), powerPS := NA]
auto[powerPS < 40, powerPS := NA]
ggplot(aes(x=vehicleType, y=powerPS), data = auto) +
geom_boxplot()
Vehicle Type is one of the fundamental categorical variables that has an impact on price (e.g. SUV vs. kleinwagen). Therefore, all the blank rows (vehicleType==“”) are cleaned.
summary(auto$vehicleType)
auto <- auto[vehicleType != ""]
summary(auto$vehicleType)
After visualizing and exploring the raw version of ‘Year of Registration’, we decided the limits to be 1975 and 2016.
summary(auto$yearOfRegistration)
ggplot(aes(x="yearOfRegistration", y=yearOfRegistration), data = auto) +
geom_boxplot()
ggplot(aes(x=vehicleType, y=yearOfRegistration), data = auto) +
geom_boxplot() +
ylim(1975, 2016)
ggplot(aes(x=yearOfRegistration, y=price, alpha = 1/100), data = auto) +
geom_point() +
facet_wrap(~vehicleType) +
xlim(1975, 2016)+
geom_smooth()
auto <- auto[(yearOfRegistration >= 1975) & (yearOfRegistration < 2016)]
We did not clean our dataset according to every variable, rather used data=subset() whenever necessary. The month of registration (month in which the second-hand ad was put on the website) is not analyzed in depth as well. Did not possess strong relationship with any of the fundamental variables, most importantly price.
summary(auto$monthOfRegistration)
table(auto$monthOfRegistration, auto$yearOfRegistration)
table(auto$monthOfRegistration)
auto[monthOfRegistration == 0, monthOfRegistration := NA]
table(auto$monthOfRegistration)
str(auto$monthOfRegistration)
The blank fuel type rows are set as NA, after all we expected fuel type to have considerable impact on second-hand car prices.
summary(auto$fuelType)
table(auto$vehicleType, auto$fuelType)
auto[fuelType == "", fuelType := NA]
table(auto$vehicleType, auto$fuelType)
When we first observed the ‘brand’ variable, we saw that almost 40 different brands exist. This abundance motivated us to focus on vehicle type and a cluster of most popular brands. Later on, we developed auto_subbrand dataset for analysis that focuses on top ten brands. We will come to this shortly.
summary(auto$brand)
table(auto$vehicleType, auto$fuelType)
summary(auto$gearbox)
auto[gearbox == "", gearbox := NA]
summary(auto$gearbox)
The Model variable is not used much because of the extreme number of combination involving both brand and vehicle type.
summary(auto$model)
auto[model == "", model := NA]
summary(auto$model)
The concentration on 150.000 kilometer attracted our attention but we kept it all at first.
summary(auto$kilometer)
ggplot(aes(x=vehicleType, y=kilometer), data = auto) +
geom_boxplot()
NotRepairedDamaged variable is not utilized at all due to its limited structure (Yes/No). It was not possible to measure the degree of the damage on the price.
summary(auto$notRepairedDamage)
table(auto$notRepairedDamage)
auto[notRepairedDamage == "", notRepairedDamage := NA]
table(auto$notRepairedDamage)
str(auto$notRepairedDamage)
We added a new variable as ‘Selling Time’ into the dataset. ‘Selling Time’ is derived from ‘Date Created’ and ‘Last Seen’ by using as.factor().
auto$sellingTime <- as.integer(as.Date(auto$lastSeen) - as.Date(auto$dateCreated))
str(auto$sellingTime)
summary(auto$sellingTime)
table(auto$sellingTime)
max(auto$sellingTime)
min(auto$sellingTime)
auto[sellingTime > 35, sellingTime := NA]
table(max(auto$sellingTime) - auto$sellingTime)
We added a new variable as ‘Age’ into the dataset. ‘Age’ is derived from ‘Year of Registration’ by using as.factor().
auto$age <- as.factor(year(today()) - auto$yearOfRegistration)
The postal code dataset is standardized by using clean.zipcodes.
str(auto$postalCode)
auto$postalCode <- as.factor(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)
auto$postalCode = clean.zipcodes(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)
library(data.table)
library(ggplot2)
library(lubridate)
library(dplyr)
library(gridExtra)
Limousine, kombi and kleinwagen are the most popular vehicle types in the second-hand market according to our dataset.
ggplot(auto, aes(x=vehicleType)) +
geom_bar(fill='darkgreen', color='black') +
scale_fill_brewer(type= 'div') +
labs(x= 'Vehicle Type', y= 'number of cars') +
ggtitle('Vehicle Type Frequency Diagram')
The mean and median of the PowerPS is around 105.
ggplot(auto, aes(auto$powerPS)) +
geom_histogram(fill= I('#F79420'), color='black', binwidth=15) +
labs(x= 'engine power', y= 'number of cars') +
ggtitle('Histogram of Engine Power (PowerPS)')
## Warning: Removed 38011 rows containing non-finite values (stat_bin).
We noticed that majority of the second-hand cars are sold only within 35 days. The ratio of the first 10 days (day 0 stands for same day sale) is quite high. This shows us that either Ebay-Kleinanzeigen is very successful at targeting customers or the second-hand market is more fluid that we actually thought.
#Selling Time Histogram
ggplot(data=auto, aes(auto$sellingTime)) +
geom_histogram(breaks=seq(0, 35, by = 5),
col="red",
fill="green",
alpha = .2) +
labs(title="Histogram for Selling Time") +
labs(x="Selling Time", y="Count")
## Warning: Removed 294 rows containing non-finite values (stat_bin).
The number of cars with manual gearbox is higher than the automatic ones. This is not surprising considering the ages and kilometers of the second-hand cars.
#Gearbox
ggplot(aes(x= gearbox), data=subset(auto, !is.na(gearbox))) +
geom_bar(color='black', fill='orange') +
labs(x= 'Gearbox', y='Number of Cars', title='Gearbox')
The concentration on the 100.000+ km, particularly 150.000 km, is interesting, even that scale_y_log10() becomes necessary to observe the distribution.
ggplot(aes(auto$kilometer), data=auto) +
geom_bar(color='black', fill='orange') +
scale_y_log10()
labs(x= 'Kilometer', y='Number of Cars', title='Kilometer Histogram')
## $x
## [1] "Kilometer"
##
## $y
## [1] "Number of Cars"
##
## $title
## [1] "Kilometer Histogram"
##
## attr(,"class")
## [1] "labels"
Age variable shows normal distribution with mean and median of approximately 14 years.
ggplot(aes(as.integer(auto$age)), data=auto) +
geom_histogram(color='black', fill='brown') +
scale_x_continuous(limit=c(0, 35), breaks=seq(0, 35, 2)) +
labs(x= 'Car Age', y= 'Number of Cars', title= 'Car Age Histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1214 rows containing non-finite values (stat_bin).
Majority of the second hand cars’ fuel type are benzin or diesel (as expected) but hybrid, elektro, cng stands for emergence of new trends in the second-hand car market.
ggplot(aes(x=fuelType), data=subset(auto, !is.na(fuelType))) +
geom_bar(aes(fill=fuelType), color='black') +
labs(x= 'Fuel Type', y='Number of Cars', title= 'Fuel Type Frequency Diagram')
We observe that SUV is the most expensive vehicle type while kleinwagen is the cheapest. However, kleinwagen have many outliers which may signify either user error or specific higher end brand and model combination.
ggplot(auto,aes(x=vehicleType, y=price)) +
geom_boxplot(aes(fill = vehicleType)) +
stat_summary(fun.y = mean, geom="point", size=2) +
xlab('Vehicle Type') +
ylab('Price') +
ggtitle('Price vs. Vehicle Type')
There is linear correlation between engine power (PowerPS) and price in each vehicle type but after 150 powerPS it is possible to observe non-linarites.
ggplot(data = subset(auto, !is.na(powerPS)), aes(x = powerPS, y = price)) +
geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
geom_smooth() +
facet_wrap(~vehicleType) +
xlab('Engine Power') +
ylab('Price') +
ggtitle('Engine Power vs. Price')
## `geom_smooth()` using method = 'gam'
This graph shows similar trend with the correlation between price and vehicle type. Kleinwagen has the lowest average engine power yet the outliers may stand for specific higher end brand-model combination or user error.
ggplot(data= subset(auto, !is.na(powerPS)), aes(x= vehicleType, y= powerPS)) +
geom_boxplot(alpha = 1/50, color = I("#990000")) +
stat_summary(fun.y = mean, geom="point", size=2) +
xlab('Vehicle Type') +
ylab('Engine Power') +
ggtitle('Engine Power vs. Vehicle Type')
There does not seem to be correlation between selling time and vehicle type. We thought certain vehicle types, such as bus, may require longer selling period which does not seem to be the case.
ggplot(aes(x= vehicleType, y= sellingTime), data=auto) +
geom_boxplot(fill="lightblue", color='black') +
geom_boxplot(aes(fill = vehicleType)) +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Vehicle Type', y= 'Selling Time') +
ggtitle('Selling Time vs. Vehicle Type')
## Warning: Removed 294 rows containing non-finite values (stat_boxplot).
## Warning: Removed 294 rows containing non-finite values (stat_boxplot).
## Warning: Removed 294 rows containing non-finite values (stat_summary).
Similarly, we thought we would see a meaningful relationship between selling time and price. There does not seem to be strong correlation.
cor.test(auto$sellingTime, auto$price)
##
## Pearson's product-moment correlation
##
## data: auto$sellingTime and auto$price
## t = 82.327, df = 300660, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1449817 0.1519730
## sample estimates:
## cor
## 0.1484792
ggplot(data = auto , aes(x = sellingTime, y = price)) +
geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
geom_smooth() +
facet_wrap(~vehicleType) +
xlab('Selling Time') +
ylab('Price') +
ggtitle('Selling Time vs. Price')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 294 rows containing non-finite values (stat_smooth).
## Warning: Removed 294 rows containing missing values (geom_point).
Like vehicle type and price, a significant trend between selling time and kilometer did not emerge.
cor.test(auto$sellingTime, auto$kilometer)
##
## Pearson's product-moment correlation
##
## data: auto$sellingTime and auto$kilometer
## t = -38.776, df = 300660, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07409592 -0.06698259
## sample estimates:
## cor
## -0.07054015
ggplot(data = auto, aes(x = sellingTime, y = kilometer)) +
geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
geom_smooth() +
facet_wrap(~vehicleType) +
xlab('Selling Time') +
ylab('Kilometer') +
ggtitle('Kilometer vs. Selling Time')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 294 rows containing non-finite values (stat_smooth).
## Warning: Removed 294 rows containing missing values (geom_point).
Like vehicle type, price and kilometer, a significant trend between selling time and engine power did not emerge.
cor.test(auto$sellingTime, auto$powerPS)
##
## Pearson's product-moment correlation
##
## data: auto$sellingTime and auto$powerPS
## t = 28.575, df = 262690, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05185358 0.05947806
## sample estimates:
## cor
## 0.05566663
ggplot(data = auto, aes(x = sellingTime, y = powerPS)) +
geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
geom_smooth() +
facet_wrap(~vehicleType) +
xlab('Selling Time') +
ylab('Engine Power') +
ggtitle('Engine Power vs. Selling Time')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 38268 rows containing non-finite values (stat_smooth).
## Warning: Removed 38268 rows containing missing values (geom_point).
As expected, the second-hand cars with automatic gearbox are more expensive than manual ones.
ggplot(aes(x= gearbox, y=price), data=subset(auto, !is.na(gearbox))) +
geom_boxplot(aes(fill=gearbox), color='black') +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Gearbox', y= 'Price', title= 'Price vs. Gearbox')
In all vehicle types, the gearbox trend (automatic ones are more expensive) holds yet the manual ones have several outliers.
ggplot(aes(x= vehicleType, y=price), data=subset(auto, !is.na(gearbox))) +
geom_boxplot(aes(fill=gearbox), color='black') +
labs(x= 'Vehiche Type', y= 'Price', title= "Price vs. Vehicle Type by Gearbox")
This is one of the most interesting outcomes that we had along with selling time trends. In all vehicle types, the price continues to decrease between 0-20 years (20 years is the lowest point) but starts increase after between 20-30 years. Maybe, 20+ year old second-hand cars can be considered as ‘antique’ and users’ emotional attachment may cause abnormalities.
ggplot(aes(x=as.integer(age), y=price, alpha = 1/100), data = auto) +
geom_point() +
facet_wrap(~vehicleType) +
geom_smooth() +
labs(x= "Age", y= "Price",
title= "Price vs. Age by Vehicle Type")
## `geom_smooth()` using method = 'gam'
Fuel type does not seem to have significant impact on the correlation between price and gearbox.
ggplot(aes(x= gearbox, y=price), data=subset(auto, !is.na(gearbox))) +
geom_boxplot(aes(fill=gearbox), color='black') +
facet_wrap(~fuelType) +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Gearbox', y= 'Price', title= "Price vs. Gearbox by Fuel Type")
Electric cars with manual gearbox have superior engine power performance. This is a divergence from all correlations and one of the most interesting things we have found.
ggplot(aes(x= gearbox, y=powerPS), data=subset(auto, !is.na(gearbox))) +
geom_boxplot(aes(fill=gearbox), color='black') +
facet_wrap(~fuelType) +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Gearbox', y= 'Engine Power ', title= "Engine Power vs. Gearbox by Fuel Type")
## Warning: Removed 33230 rows containing non-finite values (stat_boxplot).
## Warning: Removed 33230 rows containing non-finite values (stat_summary).
Selling time does not seem to be affected by the combinations of gearbox and fuel type as well, with the exception of CNG cars (which may indicate either low numbers of observation or emerging market trend).
ggplot(aes(x= gearbox, y=sellingTime), data=subset(auto, !is.na(gearbox))) +
geom_boxplot(aes(fill=gearbox), color='black') +
facet_wrap(~fuelType) +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Gearbox', y= 'Selling Time ', title= "Selling Time vs. Gearbox by Fuel Type")
## Warning: Removed 288 rows containing non-finite values (stat_boxplot).
## Warning: Removed 288 rows containing non-finite values (stat_summary).
The electric and CNG cars show longer selling time trend which may indicate that second-hand car market for hybrid cars have not matured yet.
ggplot(data=subset(auto, !is.na(gearbox) & !is.na(fuelType)), aes(x=fuelType, y=sellingTime)) +
geom_boxplot(aes(fill=fuelType), color='black') +
facet_wrap(~gearbox) +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Fuel Type', y= 'Selling Time', title= "Selling Time vs. Fuel Type by Gearbox")
## Warning: Removed 276 rows containing non-finite values (stat_boxplot).
## Warning: Removed 276 rows containing non-finite values (stat_summary).
ggplot(aes(x= fuelType, y= price), data= subset(auto, !is.na(fuelType))) +
geom_boxplot(aes(fill= fuelType), color= 'black') +
stat_summary(fun.y = mean, geom="point", size=2) +
labs(x= 'Fuel Type', y='Price', title= 'Price vs. Fuel Type by Vehichle Type') +
facet_wrap(~vehicleType)
As shown previously, significant chunk of cars is sold within the first 10 days. However, the price of the cars is slightly related with the selling time between 10-20 days (the higher the price, the longer the selling time). This trend is especially observable in SUV vehicles.
sellingTimeGroup <- group_by(sellingTime, vehicleType, .data = auto)
auto.seltime_by_price <- summarise(sellingTimeGroup,
price_mean = mean(price),
price_median = median(price),
count = n())
summary(auto.seltime_by_price)
## sellingTime vehicleType price_mean price_median
## Min. : 0.00 bus :37 Min. : 1918 Min. : 1200
## 1st Qu.: 8.00 cabrio :37 1st Qu.: 4552 1st Qu.: 3100
## Median :17.00 kleinwagen:37 Median : 5910 Median : 4492
## Mean :17.33 kombi :37 Mean : 5981 Mean : 4830
## 3rd Qu.:26.00 limousine :37 3rd Qu.: 7027 3rd Qu.: 5850
## Max. :35.00 andere :36 Max. :14750 Max. :14750
## NA's :8 (Other) :72
## count
## Min. : 1
## 1st Qu.: 162
## Median : 466
## Mean : 1027
## 3rd Qu.: 1122
## Max. :12616
##
head(auto.seltime_by_price)
## Source: local data frame [6 x 5]
## Groups: sellingTime [1]
##
## sellingTime vehicleType price_mean price_median count
## <int> <fctr> <dbl> <dbl> <int>
## 1 0 andere 2814.313 1900 428
## 2 0 bus 4308.251 3200 4073
## 3 0 cabrio 5020.976 3490 1812
## 4 0 coupe 4801.940 2950 1544
## 5 0 kleinwagen 1918.010 1200 10279
## 6 0 kombi 3601.546 2499 9129
p1 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=price_mean, color=vehicleType)) + xlim(0,30)
p2 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=price_median, color=vehicleType)) + xlim(0,30)
p3 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=count, color=vehicleType)) + xlim(0,30)
grid.arrange(p1, p2, p3, ncol = 1)
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).
As mentioned earlier, we generated subbrand$brand that signifies ‘Top 10 Brands’ corresponding to nearly 80% of the second-hand cars. ### Top 10 BrandsFrequency Table
d2 <- auto %>%
count(brand) %>%
top_n(10) %>%
arrange(n, brand) %>%
mutate(brand = factor(brand, levels = unique(brand)))
## Selecting by n
d2
## # A tibble: 10 × 2
## brand n
## <fctr> <int>
## 1 seat 5801
## 2 fiat 7744
## 3 peugeot 9680
## 4 renault 14614
## 5 ford 20826
## 6 audi 25977
## 7 mercedes_benz 28464
## 8 opel 32615
## 9 bmw 32820
## 10 volkswagen 64038
auto_subbrand <- subset (auto, brand %in% c("seat", "fiat","peugeot" ,"renault", "ford", "audi", "mercedes_benz" , "opel", "bmw" , "volkswagen"))
ggplot(auto_subbrand, aes(x = reorder(brand, -table(brand)[brand]))) +
geom_bar(color='black', fill= 'darkred') +
labs(x= 'Top 10 Brands', y= 'Number of Cars', ggtitle= 'Top 10 Brands Frequency Diagram')
ggplot(auto_subbrand, aes(x=brand, y= price)) +
geom_boxplot(aes(fill= brand), color= 'black') +
stat_summary(fun.y = mean, geom="point", size=1) +
facet_wrap(~vehicleType) +
labs(x= 'Top 10 Brands', y= 'Price', ggtitle= 'Price vs. Top 10 Brands by Vehicle Type')
library(corrplot)
auto_corr <- subset(auto, !is.na(powerPS))[,.(price,powerPS,kilometer,as.numeric(age))]
colnames(auto_corr) <- c("Price","Engine Power\n(PowerPS)","Kilometer","Age")
corrplot.mixed(cor(auto_corr))
Firstly we splited data into 2 parts. 1-Training Data 2-Validation Data We run regression model in training data than using this model coefficients to predict validation data. We also check variance inflation factor scores and exclude the variables which is higher then 10.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Create dummy variebles
auto$bus <- ifelse(auto$vehicleType=="bus",1,0)
auto$cabrio <- ifelse(auto$vehicleType=="cabrio",1,0)
auto$coupe <- ifelse(auto$vehicleType=="coupe",1,0)
auto$kleinwagen <- ifelse(auto$vehicleType=="kleinwagen",1,0)
auto$kombi <- ifelse(auto$vehicleType=="kombi",1,0)
auto$limousine <- ifelse(auto$vehicleType=="limousine",1,0)
auto$suv <- ifelse(auto$vehicleType=="suv",1,0)
auto$automatik <- ifelse(auto$gearbox=="automatik",1,0)
auto$manuell <- ifelse(auto$gearbox=="manuell",1,0)
auto$benzin <- ifelse(auto$fuelType=="benzin",1,0)
auto$diesel <- ifelse(auto$fuelType=="diesel",1,0)
auto$lpg <- ifelse(auto$fuelType=="lpg",1,0)
names(auto)
## [1] "dateCrawled" "price" "vehicleType"
## [4] "yearOfRegistration" "gearbox" "powerPS"
## [7] "model" "kilometer" "monthOfRegistration"
## [10] "fuelType" "brand" "notRepairedDamage"
## [13] "dateCreated" "postalCode" "lastSeen"
## [16] "sellingTime" "age" "bus"
## [19] "cabrio" "coupe" "kleinwagen"
## [22] "kombi" "limousine" "suv"
## [25] "automatik" "manuell" "benzin"
## [28] "diesel" "lpg"
# Split Data
set.seed(100)
subauto <- sample(nrow(auto), floor(nrow(auto)*0.7))
training_data <- auto[subauto,]
validation_data <- auto[-subauto,]
training_data$age_num<-as.numeric(training_data$age)
validation_data$age_num<-as.numeric(validation_data$age)
# multiple regression model
# built model with all variables
str(training_data$age_num)
## num [1:210669] 6 7 5 14 16 16 14 12 9 10 ...
fit <- lm(price ~ kilometer + age_num + powerPS+bus+cabrio+coupe+
kleinwagen+kombi+limousine+suv+
automatik+manuell+benzin+diesel+lpg, data=training_data)
summary(fit) # show results
##
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio +
## coupe + kleinwagen + kombi + limousine + suv + automatik +
## manuell + benzin + diesel + lpg, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17537.2 -1590.1 -252.5 1147.3 23118.2
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.888e+03 1.360e+02 72.688 < 2e-16 ***
## kilometer -3.762e-02 2.003e-04 -187.817 < 2e-16 ***
## age_num -3.108e+02 1.337e+00 -232.481 < 2e-16 ***
## powerPS 3.596e+01 2.065e-01 174.151 < 2e-16 ***
## bus 1.324e+02 7.213e+01 1.836 0.0663 .
## cabrio 1.442e+03 7.405e+01 19.472 < 2e-16 ***
## coupe 5.809e+02 7.584e+01 7.659 1.88e-14 ***
## kleinwagen -4.476e+02 7.104e+01 -6.301 2.97e-10 ***
## kombi -6.407e+02 7.088e+01 -9.038 < 2e-16 ***
## limousine -8.112e+01 7.050e+01 -1.151 0.2499
## suv 1.584e+03 7.743e+01 20.454 < 2e-16 ***
## automatik 2.987e+02 1.741e+01 17.157 < 2e-16 ***
## manuell NA NA NA NA
## benzin -7.943e+02 1.150e+02 -6.904 5.07e-12 ***
## diesel 5.271e+02 1.151e+02 4.581 4.64e-06 ***
## lpg -1.408e+03 1.255e+02 -11.219 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2595 on 175102 degrees of freedom
## (35552 observations deleted due to missingness)
## Multiple R-squared: 0.6372, Adjusted R-squared: 0.6372
## F-statistic: 2.197e+04 on 14 and 175102 DF, p-value: < 2.2e-16
#using stepwise, reduce variables if posible
fit.step <- stepAIC(fit)
## Start: AIC=2753302
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe +
## kleinwagen + kombi + limousine + suv + automatik + manuell +
## benzin + diesel + lpg
##
##
## Step: AIC=2753302
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe +
## kleinwagen + kombi + limousine + suv + automatik + benzin +
## diesel + lpg
##
## Df Sum of Sq RSS AIC
## - limousine 1 8.9131e+06 1.1790e+12 2753301
## <none> 1.1790e+12 2753302
## - bus 1 2.2699e+07 1.1790e+12 2753303
## - diesel 1 1.4129e+08 1.1791e+12 2753321
## - kleinwagen 1 2.6732e+08 1.1793e+12 2753339
## - benzin 1 3.2095e+08 1.1793e+12 2753347
## - coupe 1 3.9501e+08 1.1794e+12 2753358
## - kombi 1 5.5003e+08 1.1795e+12 2753381
## - lpg 1 8.4752e+08 1.1798e+12 2753425
## - automatik 1 1.9820e+09 1.1810e+12 2753594
## - cabrio 1 2.5529e+09 1.1815e+12 2753678
## - suv 1 2.8170e+09 1.1818e+12 2753717
## - powerPS 1 2.0421e+11 1.3832e+12 2781273
## - kilometer 1 2.3751e+11 1.4165e+12 2785440
## - age_num 1 3.6391e+11 1.5429e+12 2800407
##
## Step: AIC=2753301
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe +
## kleinwagen + kombi + suv + automatik + benzin + diesel +
## lpg
##
## Df Sum of Sq RSS AIC
## <none> 1.1790e+12 2753301
## - diesel 1 1.4103e+08 1.1791e+12 2753320
## - benzin 1 3.2239e+08 1.1793e+12 2753347
## - bus 1 5.4247e+08 1.1795e+12 2753379
## - lpg 1 8.4930e+08 1.1798e+12 2753425
## - automatik 1 1.9779e+09 1.1810e+12 2753592
## - kleinwagen 1 2.3343e+09 1.1813e+12 2753645
## - coupe 1 2.9411e+09 1.1819e+12 2753735
## - kombi 1 6.4160e+09 1.1854e+12 2754249
## - suv 1 1.4467e+10 1.1935e+12 2755435
## - cabrio 1 2.0938e+10 1.1999e+12 2756382
## - powerPS 1 2.0467e+11 1.3837e+12 2781330
## - kilometer 1 2.3757e+11 1.4166e+12 2785446
## - age_num 1 3.6402e+11 1.5430e+12 2800419
summary(fit.step)
##
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio +
## coupe + kleinwagen + kombi + suv + automatik + benzin + diesel +
## lpg, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17536.6 -1590.3 -252.6 1147.3 23117.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.812e+03 1.189e+02 82.539 < 2e-16 ***
## kilometer -3.762e-02 2.003e-04 -187.840 < 2e-16 ***
## age_num -3.107e+02 1.336e+00 -232.516 < 2e-16 ***
## powerPS 3.595e+01 2.062e-01 174.346 < 2e-16 ***
## bus 2.109e+02 2.350e+01 8.976 < 2e-16 ***
## cabrio 1.521e+03 2.728e+01 55.765 < 2e-16 ***
## coupe 6.602e+02 3.159e+01 20.900 < 2e-16 ***
## kleinwagen -3.691e+02 1.982e+01 -18.620 < 2e-16 ***
## kombi -5.618e+02 1.820e+01 -30.869 < 2e-16 ***
## suv 1.663e+03 3.587e+01 46.353 < 2e-16 ***
## automatik 2.983e+02 1.740e+01 17.139 < 2e-16 ***
## benzin -7.960e+02 1.150e+02 -6.920 4.55e-12 ***
## diesel 5.267e+02 1.151e+02 4.577 4.73e-06 ***
## lpg -1.410e+03 1.255e+02 -11.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2595 on 175103 degrees of freedom
## (35552 observations deleted due to missingness)
## Multiple R-squared: 0.6372, Adjusted R-squared: 0.6372
## F-statistic: 2.366e+04 on 13 and 175103 DF, p-value: < 2.2e-16
# multicollinearity
library(fmsb)
library(car)
#variance inflation factor
vif(fit)
fitnew <- lm(price ~ kilometer + age_num + powerPS+bus+cabrio+coupe+
+kombi+suv+
automatik+manuell+benzin+diesel+lpg, data=training_data)
summary(fitnew) # show results
##
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio +
## coupe + +kombi + suv + automatik + manuell + benzin + diesel +
## lpg, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17597.0 -1589.9 -251.9 1139.9 23136.3
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.457e+03 1.175e+02 80.516 < 2e-16 ***
## kilometer -3.764e-02 2.005e-04 -187.754 < 2e-16 ***
## age_num -3.063e+02 1.316e+00 -232.727 < 2e-16 ***
## powerPS 3.780e+01 1.808e-01 209.031 < 2e-16 ***
## bus 3.411e+02 2.245e+01 15.192 < 2e-16 ***
## cabrio 1.643e+03 2.652e+01 61.947 < 2e-16 ***
## coupe 7.457e+02 3.128e+01 23.837 < 2e-16 ***
## kombi -4.587e+02 1.736e+01 -26.433 < 2e-16 ***
## suv 1.751e+03 3.559e+01 49.183 < 2e-16 ***
## automatik 2.773e+02 1.739e+01 15.948 < 2e-16 ***
## manuell NA NA NA NA
## benzin -8.596e+02 1.151e+02 -7.468 8.17e-14 ***
## diesel 5.013e+02 1.152e+02 4.352 1.35e-05 ***
## lpg -1.461e+03 1.256e+02 -11.631 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2597 on 175104 degrees of freedom
## (35552 observations deleted due to missingness)
## Multiple R-squared: 0.6365, Adjusted R-squared: 0.6365
## F-statistic: 2.555e+04 on 12 and 175104 DF, p-value: < 2.2e-16
library(relaimpo)
calc.relimp(fitnew, type = c("lmg"), rela = TRUE)
vif(fitnew)
# predict values on training set
training_data$predict.price <- predict(fitnew)
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 175117
## items to be assigned to 210669 items of column 'predict.price' (recycled
## leaving remainder of 35552 items).
training_data$error <- residuals((fitnew))
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 175117
## items to be assigned to 210669 items of column 'error' (recycled leaving
## remainder of 35552 items).
# predict values on validation set
validation_data$predict.price <- predict(fitnew, newdata = validation_data)
## Warning in predict.lm(fitnew, newdata = validation_data): prediction from a
## rank-deficient fit may be misleading
validation_data$error <- validation_data$predict.price-validation_data$price
# check residuals plots
hist(training_data$error)
hist(validation_data$error)
ggplot(data =validation_data, aes(x=price, y=predict.price) )+
geom_point(alpha=1/100)+
geom_smooth()+
facet_wrap(~vehicleType)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 15387 rows containing non-finite values (stat_smooth).
## Warning: Removed 15387 rows containing missing values (geom_point).
library(zipcode)
library(ggmap)
mapGermany = ggmap(get_map(location = "Germany", zoom = 6), extent="normal")
# Get postalcode, latitude and longitude of Germany
# The zipcode data url is downloaded from "https://www.aggdata.com/free/germany-postal-codes"
library(data.table)
postalcodeGermany <- read.csv("http://timucin.anuslu.com/wp-content/uploads/sites/2/2017/01/de_postal_codes.txt")
postalcodeGermany$placeName <- NULL
postalcodeGermany$state <- NULL
postalcodeGermany$stateAbbreviation <- NULL
postalcodeGermany$city <- NULL
postalcodeGermany$V8 <- NULL
postalcodeGermany$postalCode <- clean.zipcodes(postalcodeGermany$postalCode)
auto_map <- merge(auto, postalcodeGermany, by = "postalCode")
The map shows the vehicle density by vehicle type on USA map.
mapGermany+geom_point(data=auto_map, aes(x=longitude, y=latitude, colour=vehicleType, alpha=1/100))+
labs(x= 'Longitude', y='Latitude', title = 'Vehicle Type on Map')
The map shows the vehicle density by vehicle type on USA map.
auto_subbrand_map <- merge(auto_subbrand, postalcodeGermany, by = "postalCode")
p1 <- mapGermany+geom_point(data=subset(auto_subbrand_map, price < 5000), aes(x=longitude, y=latitude, color=price))+
facet_wrap(~vehicleType) +
scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50",
guide = "colourbar") +
labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")
p2 <- mapGermany+geom_point(data=subset(auto_subbrand_map, 5000 <= price & price < 10000), aes(x=longitude, y=latitude, color=price))+
facet_wrap(~vehicleType) +
scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50",
guide = "colourbar") +
labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")
p3 <- mapGermany+geom_point(data=subset(auto_subbrand_map, 10000 <= price), aes(x=longitude, y=latitude, color=price))+
facet_wrap(~vehicleType) +
scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50",
guide = "colourbar") +
labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")
grid.arrange(p1, p2, p3, nrow = 3)
Any suggestions / comments are appreciated.
Thanks for checking out :)