Team Members of Data Crunchers:

Background

According to Forbes, the industry has grown by a massive 68% since hitting a trough during the 2009 global financial crises according to a report published by car auction company Manheim earlier this year.

Q3 2016 closed with 9.8M vehicles sold in the used market -an increase of 3.3% over the previous year. Also the average retail used vehicle sold for $19.232 in Q3 2016, an increase of 4.3% over last year. This increase in overall values is fueled by the younger age (4.0 years on average) of vehicles sold at franchise dealers.

Changes in new car buying behavior are beginning to alter the landscape of franchised used vehicles. full article

So both franchised used car firms and other giant online marketplaces like E-Bay are leveraging the growth rate of used car industy. As a result, we tried to understand this market and its dynamics with the help of ‘Used Car Database’.

Data Analysis

Our report explores Kaggle’s ‘Used Car Database’ scraped from Ebay-Kleinanzeigen (in German) containing prices and attributes for approximately 370.000 second-hand cars of 40 unique brands. Data contains ads that created between March 2015 and April 2016. Also the data crawled between March 5, 2016 and April 7, 2016.

Our aim is to understand this growing used car market. When we look at used cars, price is the most important factor that influences opinions, but there are also few other facts that affect the purchase decision. So we tried to find out which variables affect the price most and how they do it.

The used car database not only provides vast amount of observations but also presents several variables for exploratory data analysis. We removed some variables, generated others, cleaned the data, formed upper and lower limits, justified some of our initial guesses while failing in others. In addition to several univariate, bivariate and multivariate graphs, we ran a linear regression analysis and worked on maps to make sense of the zipcode.

First of all, we reduced the number of variables from 20 to 15 to explore the dataset. The variables ‘Name’, ‘Offer Type’, ‘Seller’, ‘Number of Pictures’ and ‘AB Test’ are eliminated from the analysis based on sum() and table() queries. According to our analysis, these variables have no significant impact on explaining the data. For example, vast majority of the second-hand car ads have pictures (no distinction) while names of the ads were difficult to analyze (user generated content).

Apart from the elimination, we refined and standardized the variables which include date and time) ‘Date Crawled’, ‘Date Created’, ‘Last Seen’) by “lubridate” package.

That makes the First part of data cleaning is complete.

Data Cleaning

Using Libraries

library(data.table)
library(ggplot2)
library(lubridate)
library(zipcode)
library(dplyr)

Data Getting

setwd("~/bda503/project")
auto <- fread("autos.csv", stringsAsFactors = T)
str(auto)
## Classes 'data.table' and 'data.frame':   371824 obs. of  20 variables:
##  $ dateCrawled        : Factor w/ 280652 levels "2016-03-05 14:06:22",..: 164643 164374 76603 106256 231281 271106 243057 141578 275651 103170 ...
##  $ name               : Factor w/ 233701 levels "!!!!!!!!!!!!!!!!!!!!!!!!__GOLF_3_CABRIO___!!!!!!!!!!!!!!!!!!!!!!!",..: 79317 4177 90686 76703 170641 27425 145443 188508 64303 190556 ...
##  $ seller             : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
##  $ offerType          : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
##  $ vehicleType        : Factor w/ 9 levels "","andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ powerPS            : int  0 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 252 levels "","100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  0 5 8 6 7 10 8 7 8 0 ...
##  $ fuelType           : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
##  $ dateCreated        : Factor w/ 114 levels "2014-03-10 00:00:00",..: 100 100 90 93 107 111 108 97 111 93 ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : Factor w/ 182904 levels "2016-03-05 14:15:08",..: 178938 178454 163054 45304 172012 175601 165721 89623 157785 126444 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Cleaning Empty Data and Convert Date

auto$offerType <- NULL
auto$name <- NULL
auto$seller <- NULL
auto$nrOfPictures <- NULL
auto$abtest <- NULL

auto$dateCrawled <- ymd_hms(auto$dateCrawled) 
auto$dateCreated <- ymd_hms(auto$dateCreated)
auto$lastSeen <- ymd_hms(auto$lastSeen)

Price Cleaning

We transformed the price by correcting outliers with quantile(). After experimenting with the limits, we decided that “quantile(autoprice,0.05),quantile(autoprice, 0.95)” is the best of the both worlds and the data cleaned accordingly. Then, we started to explore what might affect the second-hand cars’ prices.

summary(auto$price)
quantile(auto$price, 0.90)
quantile(auto$price, 0.99)
quantile(auto$price, 0.01)
quantile(auto$price, 0.05)
quantile(auto$price, 0.10)

ggplot(aes(x=vehicleType, y=price), data = auto) + 
  geom_boxplot() +
  ylim(quantile(auto$price, 0.05), quantile(auto$price, 0.95))


p1 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot()

p2 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.99))

p3 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.95))

p4 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.90))

library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
auto <- auto[(price > quantile(auto$price, 0.05)) & (price < quantile(auto$price, 0.95))]
ggplot(aes(x=vehicleType, y=price), data = auto) + 
  geom_boxplot()

Engine Power Cleaning

We observed that the engine power (PowerPS) column included extreme outliers, therefore after our first look we corrected them with quantile(). The sample value of engine power is set as NA when the value is lower than 40.

summary(auto$powerPS)

p1 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot()
p2 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot() +
  ylim(quantile(auto$powerPS, 0.05), quantile(auto$powerPS, 0.95))

grid.arrange(p1, p2, ncol = 1)
auto[(powerPS < quantile(powerPS, 0.05)) | (powerPS > quantile(powerPS, 0.95)), powerPS := NA]
auto[powerPS  < 40, powerPS := NA]
ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot()

Vehicle Type Cleaning by Removing Empty Cells

Vehicle Type is one of the fundamental categorical variables that has an impact on price (e.g. SUV vs. kleinwagen). Therefore, all the blank rows (vehicleType==“”) are cleaned.

summary(auto$vehicleType)
auto <- auto[vehicleType != ""]
summary(auto$vehicleType)

Year of Registration Cleaning

After visualizing and exploring the raw version of ‘Year of Registration’, we decided the limits to be 1975 and 2016.

summary(auto$yearOfRegistration)
ggplot(aes(x="yearOfRegistration", y=yearOfRegistration), data = auto) + 
  geom_boxplot()

ggplot(aes(x=vehicleType, y=yearOfRegistration), data = auto) + 
  geom_boxplot() +
  ylim(1975, 2016)

ggplot(aes(x=yearOfRegistration, y=price, alpha = 1/100), data = auto) + 
  geom_point() +
  facet_wrap(~vehicleType) +
  xlim(1975, 2016)+
  geom_smooth()
auto <- auto[(yearOfRegistration >= 1975) & (yearOfRegistration < 2016)]

Month of Registration Cleaning

We did not clean our dataset according to every variable, rather used data=subset() whenever necessary. The month of registration (month in which the second-hand ad was put on the website) is not analyzed in depth as well. Did not possess strong relationship with any of the fundamental variables, most importantly price.

summary(auto$monthOfRegistration)
table(auto$monthOfRegistration, auto$yearOfRegistration)
table(auto$monthOfRegistration)
auto[monthOfRegistration == 0, monthOfRegistration := NA]
table(auto$monthOfRegistration)
str(auto$monthOfRegistration)

Fuel Type Cleaning

The blank fuel type rows are set as NA, after all we expected fuel type to have considerable impact on second-hand car prices.

summary(auto$fuelType)
table(auto$vehicleType, auto$fuelType)
auto[fuelType == "", fuelType := NA]
table(auto$vehicleType, auto$fuelType)

Brand Checking to Clean

When we first observed the ‘brand’ variable, we saw that almost 40 different brands exist. This abundance motivated us to focus on vehicle type and a cluster of most popular brands. Later on, we developed auto_subbrand dataset for analysis that focuses on top ten brands. We will come to this shortly.

summary(auto$brand)
table(auto$vehicleType, auto$fuelType)

Gearbox Cleaning

summary(auto$gearbox)
auto[gearbox == "", gearbox := NA]
summary(auto$gearbox)

Model Cleaning

The Model variable is not used much because of the extreme number of combination involving both brand and vehicle type.

summary(auto$model)
auto[model == "", model := NA]
summary(auto$model)

Kilometer Checking to Clean

The concentration on 150.000 kilometer attracted our attention but we kept it all at first.

summary(auto$kilometer)
ggplot(aes(x=vehicleType, y=kilometer), data = auto) + 
  geom_boxplot()

Not Repaired Damage Checking to Clean

NotRepairedDamaged variable is not utilized at all due to its limited structure (Yes/No). It was not possible to measure the degree of the damage on the price.

summary(auto$notRepairedDamage)
table(auto$notRepairedDamage)
auto[notRepairedDamage == "", notRepairedDamage := NA]
table(auto$notRepairedDamage)
str(auto$notRepairedDamage)

Selling Time Variable Creating and Cleaning

We added a new variable as ‘Selling Time’ into the dataset. ‘Selling Time’ is derived from ‘Date Created’ and ‘Last Seen’ by using as.factor().

auto$sellingTime <- as.integer(as.Date(auto$lastSeen) - as.Date(auto$dateCreated))
str(auto$sellingTime)
summary(auto$sellingTime)
table(auto$sellingTime)
max(auto$sellingTime)
min(auto$sellingTime)
auto[sellingTime  > 35, sellingTime := NA]
table(max(auto$sellingTime) - auto$sellingTime)

Age Variable Creating by Using Year of Registration

We added a new variable as ‘Age’ into the dataset. ‘Age’ is derived from ‘Year of Registration’ by using as.factor().

auto$age <- as.factor(year(today()) - auto$yearOfRegistration)

Postal Code Cleaning

The postal code dataset is standardized by using clean.zipcodes.

str(auto$postalCode)
auto$postalCode <- as.factor(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)
auto$postalCode = clean.zipcodes(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)

Exploratory Data Analysis

Univariate Analysis

Using Libraries

library(data.table)
library(ggplot2)
library(lubridate)
library(dplyr)
library(gridExtra)

Vehicle Type Frequency Diagram

Limousine, kombi and kleinwagen are the most popular vehicle types in the second-hand market according to our dataset.

ggplot(auto, aes(x=vehicleType)) + 
  geom_bar(fill='darkgreen', color='black') +
  scale_fill_brewer(type= 'div') +
  labs(x= 'Vehicle Type', y= 'number of cars') +
  ggtitle('Vehicle Type Frequency Diagram')

Engine Power Histogram

The mean and median of the PowerPS is around 105.

ggplot(auto, aes(auto$powerPS)) +
  geom_histogram(fill= I('#F79420'), color='black', binwidth=15) +
  labs(x= 'engine power', y= 'number of cars') +
  ggtitle('Histogram of Engine Power (PowerPS)')
## Warning: Removed 38011 rows containing non-finite values (stat_bin).

Selling Time Histogram

We noticed that majority of the second-hand cars are sold only within 35 days. The ratio of the first 10 days (day 0 stands for same day sale) is quite high. This shows us that either Ebay-Kleinanzeigen is very successful at targeting customers or the second-hand market is more fluid that we actually thought.

#Selling Time Histogram
ggplot(data=auto, aes(auto$sellingTime)) + 
  geom_histogram(breaks=seq(0, 35, by = 5), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram for Selling Time") +
  labs(x="Selling Time", y="Count")
## Warning: Removed 294 rows containing non-finite values (stat_bin).

Gearbox Frequency Diagram

The number of cars with manual gearbox is higher than the automatic ones. This is not surprising considering the ages and kilometers of the second-hand cars.

#Gearbox
ggplot(aes(x= gearbox), data=subset(auto, !is.na(gearbox))) +
  geom_bar(color='black', fill='orange') +
  labs(x= 'Gearbox', y='Number of Cars', title='Gearbox')

Kilometer Histogram

The concentration on the 100.000+ km, particularly 150.000 km, is interesting, even that scale_y_log10() becomes necessary to observe the distribution.

ggplot(aes(auto$kilometer), data=auto) +
  geom_bar(color='black', fill='orange') +
  scale_y_log10()

  labs(x= 'Kilometer', y='Number of Cars', title='Kilometer Histogram')
## $x
## [1] "Kilometer"
## 
## $y
## [1] "Number of Cars"
## 
## $title
## [1] "Kilometer Histogram"
## 
## attr(,"class")
## [1] "labels"

Age Histogram

Age variable shows normal distribution with mean and median of approximately 14 years.

ggplot(aes(as.integer(auto$age)), data=auto) +
  geom_histogram(color='black', fill='brown') +
  scale_x_continuous(limit=c(0, 35), breaks=seq(0, 35, 2)) +
  labs(x= 'Car Age', y= 'Number of Cars', title= 'Car Age Histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1214 rows containing non-finite values (stat_bin).

Fuel Type Frequency Diagram

Majority of the second hand cars’ fuel type are benzin or diesel (as expected) but hybrid, elektro, cng stands for emergence of new trends in the second-hand car market.

ggplot(aes(x=fuelType), data=subset(auto, !is.na(fuelType))) +
  geom_bar(aes(fill=fuelType), color='black') +
  labs(x= 'Fuel Type', y='Number of Cars', title= 'Fuel Type Frequency Diagram')

Bivariate Analysis

Price vs. Vehicle Type

We observe that SUV is the most expensive vehicle type while kleinwagen is the cheapest. However, kleinwagen have many outliers which may signify either user error or specific higher end brand and model combination.

ggplot(auto,aes(x=vehicleType, y=price)) +
  geom_boxplot(aes(fill = vehicleType)) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  xlab('Vehicle Type') +
  ylab('Price') +
  ggtitle('Price vs. Vehicle Type')

Engine Power vs. Price

There is linear correlation between engine power (PowerPS) and price in each vehicle type but after 150 powerPS it is possible to observe non-linarites.

ggplot(data = subset(auto, !is.na(powerPS)), aes(x = powerPS, y = price)) +
  geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
  geom_smooth() +
  facet_wrap(~vehicleType) +
  xlab('Engine Power') +
  ylab('Price') +
  ggtitle('Engine Power vs. Price')
## `geom_smooth()` using method = 'gam'

Engine Power vs. Vehicle Type

This graph shows similar trend with the correlation between price and vehicle type. Kleinwagen has the lowest average engine power yet the outliers may stand for specific higher end brand-model combination or user error.

ggplot(data= subset(auto, !is.na(powerPS)), aes(x= vehicleType, y= powerPS)) +
  geom_boxplot(alpha = 1/50, color = I("#990000")) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  xlab('Vehicle Type') +
  ylab('Engine Power') +
  ggtitle('Engine Power vs. Vehicle Type')

Selling Time vs. Vehicle Type

There does not seem to be correlation between selling time and vehicle type. We thought certain vehicle types, such as bus, may require longer selling period which does not seem to be the case.

ggplot(aes(x= vehicleType, y= sellingTime), data=auto) +
  geom_boxplot(fill="lightblue", color='black') +
  geom_boxplot(aes(fill = vehicleType)) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Vehicle Type', y= 'Selling Time') +
  ggtitle('Selling Time vs. Vehicle Type')
## Warning: Removed 294 rows containing non-finite values (stat_boxplot).

## Warning: Removed 294 rows containing non-finite values (stat_boxplot).
## Warning: Removed 294 rows containing non-finite values (stat_summary).

Selling Time vs. Price

Similarly, we thought we would see a meaningful relationship between selling time and price. There does not seem to be strong correlation.

cor.test(auto$sellingTime, auto$price)
## 
##  Pearson's product-moment correlation
## 
## data:  auto$sellingTime and auto$price
## t = 82.327, df = 300660, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1449817 0.1519730
## sample estimates:
##       cor 
## 0.1484792
ggplot(data = auto , aes(x = sellingTime, y = price)) +
  geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
  geom_smooth() +
  facet_wrap(~vehicleType) +
  xlab('Selling Time') +
  ylab('Price') +
  ggtitle('Selling Time vs. Price')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 294 rows containing non-finite values (stat_smooth).
## Warning: Removed 294 rows containing missing values (geom_point).

Kilometer vs. Selling Time

Like vehicle type and price, a significant trend between selling time and kilometer did not emerge.

cor.test(auto$sellingTime, auto$kilometer)
## 
##  Pearson's product-moment correlation
## 
## data:  auto$sellingTime and auto$kilometer
## t = -38.776, df = 300660, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07409592 -0.06698259
## sample estimates:
##         cor 
## -0.07054015
ggplot(data = auto, aes(x = sellingTime, y = kilometer)) +
  geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
  geom_smooth() +
  facet_wrap(~vehicleType) +
  xlab('Selling Time') +
  ylab('Kilometer') +
  ggtitle('Kilometer vs. Selling Time')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 294 rows containing non-finite values (stat_smooth).
## Warning: Removed 294 rows containing missing values (geom_point).

Engine Power vs. Selling Time

Like vehicle type, price and kilometer, a significant trend between selling time and engine power did not emerge.

cor.test(auto$sellingTime, auto$powerPS)
## 
##  Pearson's product-moment correlation
## 
## data:  auto$sellingTime and auto$powerPS
## t = 28.575, df = 262690, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05185358 0.05947806
## sample estimates:
##        cor 
## 0.05566663
ggplot(data = auto, aes(x = sellingTime, y = powerPS)) +
  geom_point(alpha = 1/50, color = I("#990000"), position = 'jitter') +
  geom_smooth() +
  facet_wrap(~vehicleType) +
  xlab('Selling Time') +
  ylab('Engine Power') +
  ggtitle('Engine Power vs. Selling Time')
## `geom_smooth()` using method = 'gam'
## Warning: Removed 38268 rows containing non-finite values (stat_smooth).
## Warning: Removed 38268 rows containing missing values (geom_point).

Price vs. Gearbox

As expected, the second-hand cars with automatic gearbox are more expensive than manual ones.

ggplot(aes(x= gearbox, y=price), data=subset(auto, !is.na(gearbox))) +
  geom_boxplot(aes(fill=gearbox), color='black') +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Gearbox', y= 'Price', title= 'Price vs. Gearbox')

Price vs. Vehicle Type by Gearbox

In all vehicle types, the gearbox trend (automatic ones are more expensive) holds yet the manual ones have several outliers.

ggplot(aes(x= vehicleType, y=price), data=subset(auto, !is.na(gearbox))) +
  geom_boxplot(aes(fill=gearbox), color='black') +
  labs(x= 'Vehiche Type', y= 'Price', title= "Price vs. Vehicle Type by Gearbox")

Price vs. Age by Vehicle Type

This is one of the most interesting outcomes that we had along with selling time trends. In all vehicle types, the price continues to decrease between 0-20 years (20 years is the lowest point) but starts increase after between 20-30 years. Maybe, 20+ year old second-hand cars can be considered as ‘antique’ and users’ emotional attachment may cause abnormalities.

ggplot(aes(x=as.integer(age), y=price, alpha = 1/100), data = auto) + 
  geom_point() +
  facet_wrap(~vehicleType) +
  geom_smooth() +
  labs(x= "Age", y= "Price", 
       title= "Price vs. Age by Vehicle Type")
## `geom_smooth()` using method = 'gam'

Price vs. Gearbox by Fuel Type

Fuel type does not seem to have significant impact on the correlation between price and gearbox.

ggplot(aes(x= gearbox, y=price), data=subset(auto, !is.na(gearbox))) +
  geom_boxplot(aes(fill=gearbox), color='black') +
  facet_wrap(~fuelType) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Gearbox', y= 'Price', title= "Price vs. Gearbox by Fuel Type")

Engine Power vs. Gearbox by Fuel Type

Electric cars with manual gearbox have superior engine power performance. This is a divergence from all correlations and one of the most interesting things we have found.

ggplot(aes(x= gearbox, y=powerPS), data=subset(auto, !is.na(gearbox))) +
  geom_boxplot(aes(fill=gearbox), color='black') +
  facet_wrap(~fuelType) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Gearbox', y= 'Engine Power ', title= "Engine Power  vs. Gearbox by Fuel Type")
## Warning: Removed 33230 rows containing non-finite values (stat_boxplot).
## Warning: Removed 33230 rows containing non-finite values (stat_summary).

Selling Time vs. Gearbox by Fuel Type

Selling time does not seem to be affected by the combinations of gearbox and fuel type as well, with the exception of CNG cars (which may indicate either low numbers of observation or emerging market trend).

ggplot(aes(x= gearbox, y=sellingTime), data=subset(auto, !is.na(gearbox))) +
  geom_boxplot(aes(fill=gearbox), color='black') +
  facet_wrap(~fuelType) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Gearbox', y= 'Selling Time ', title= "Selling Time  vs. Gearbox by Fuel Type")
## Warning: Removed 288 rows containing non-finite values (stat_boxplot).
## Warning: Removed 288 rows containing non-finite values (stat_summary).

Selling Time vs. Fuel Type by Gearbox

The electric and CNG cars show longer selling time trend which may indicate that second-hand car market for hybrid cars have not matured yet.

ggplot(data=subset(auto, !is.na(gearbox) & !is.na(fuelType)), aes(x=fuelType, y=sellingTime)) +
  geom_boxplot(aes(fill=fuelType), color='black') +
  facet_wrap(~gearbox) +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Fuel Type', y= 'Selling Time', title= "Selling Time vs. Fuel Type by Gearbox")
## Warning: Removed 276 rows containing non-finite values (stat_boxplot).
## Warning: Removed 276 rows containing non-finite values (stat_summary).

Price vs. Fuel Type by Vehichle Type

ggplot(aes(x= fuelType, y= price), data= subset(auto, !is.na(fuelType))) +
  geom_boxplot(aes(fill= fuelType), color= 'black') +
  stat_summary(fun.y = mean, geom="point", size=2) +
  labs(x= 'Fuel Type', y='Price', title= 'Price vs. Fuel Type by Vehichle Type') +
  facet_wrap(~vehicleType)

Selling Time Price Analysis

As shown previously, significant chunk of cars is sold within the first 10 days. However, the price of the cars is slightly related with the selling time between 10-20 days (the higher the price, the longer the selling time). This trend is especially observable in SUV vehicles.

sellingTimeGroup <- group_by(sellingTime, vehicleType, .data = auto)
auto.seltime_by_price <- summarise(sellingTimeGroup,
                          price_mean = mean(price),
                          price_median = median(price),
                          count = n())

summary(auto.seltime_by_price)
##   sellingTime        vehicleType   price_mean     price_median  
##  Min.   : 0.00   bus       :37   Min.   : 1918   Min.   : 1200  
##  1st Qu.: 8.00   cabrio    :37   1st Qu.: 4552   1st Qu.: 3100  
##  Median :17.00   kleinwagen:37   Median : 5910   Median : 4492  
##  Mean   :17.33   kombi     :37   Mean   : 5981   Mean   : 4830  
##  3rd Qu.:26.00   limousine :37   3rd Qu.: 7027   3rd Qu.: 5850  
##  Max.   :35.00   andere    :36   Max.   :14750   Max.   :14750  
##  NA's   :8       (Other)   :72                                  
##      count      
##  Min.   :    1  
##  1st Qu.:  162  
##  Median :  466  
##  Mean   : 1027  
##  3rd Qu.: 1122  
##  Max.   :12616  
## 
head(auto.seltime_by_price)
## Source: local data frame [6 x 5]
## Groups: sellingTime [1]
## 
##   sellingTime vehicleType price_mean price_median count
##         <int>      <fctr>      <dbl>        <dbl> <int>
## 1           0      andere   2814.313         1900   428
## 2           0         bus   4308.251         3200  4073
## 3           0      cabrio   5020.976         3490  1812
## 4           0       coupe   4801.940         2950  1544
## 5           0  kleinwagen   1918.010         1200 10279
## 6           0       kombi   3601.546         2499  9129
p1 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=price_mean, color=vehicleType)) + xlim(0,30)
p2 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=price_median, color=vehicleType)) + xlim(0,30)
p3 <- ggplot(auto.seltime_by_price) + geom_smooth(aes(x=sellingTime, y=count, color=vehicleType)) + xlim(0,30)
grid.arrange(p1, p2, p3, ncol = 1)
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess'
## Warning: Removed 45 rows containing non-finite values (stat_smooth).

Top 10 Brands

As mentioned earlier, we generated subbrand$brand that signifies ‘Top 10 Brands’ corresponding to nearly 80% of the second-hand cars. ### Top 10 BrandsFrequency Table

d2 <- auto %>%
  count(brand) %>%
  top_n(10) %>%
  arrange(n, brand) %>%
  mutate(brand = factor(brand, levels = unique(brand)))
## Selecting by n
d2
## # A tibble: 10 × 2
##            brand     n
##           <fctr> <int>
## 1           seat  5801
## 2           fiat  7744
## 3        peugeot  9680
## 4        renault 14614
## 5           ford 20826
## 6           audi 25977
## 7  mercedes_benz 28464
## 8           opel 32615
## 9            bmw 32820
## 10    volkswagen 64038

Top 10 Brands Frequency Diagram

auto_subbrand <- subset (auto, brand %in% c("seat", "fiat","peugeot" ,"renault", "ford", "audi", "mercedes_benz" , "opel", "bmw" , "volkswagen"))
ggplot(auto_subbrand, aes(x = reorder(brand, -table(brand)[brand]))) +
  geom_bar(color='black', fill= 'darkred') +
  labs(x= 'Top 10 Brands', y= 'Number of Cars', ggtitle= 'Top 10 Brands Frequency Diagram')

Price vs. Top 10 Brands by Vehicle Type

ggplot(auto_subbrand, aes(x=brand, y= price)) +
  geom_boxplot(aes(fill= brand), color= 'black') +
  stat_summary(fun.y = mean, geom="point", size=1) +
  facet_wrap(~vehicleType) +
  labs(x= 'Top 10 Brands', y= 'Price', ggtitle= 'Price vs. Top 10 Brands by Vehicle Type')

Correlation

library(corrplot)
auto_corr <- subset(auto, !is.na(powerPS))[,.(price,powerPS,kilometer,as.numeric(age))]
colnames(auto_corr) <- c("Price","Engine Power\n(PowerPS)","Kilometer","Age")
corrplot.mixed(cor(auto_corr))

Linear Regression Model for Price

Firstly we splited data into 2 parts. 1-Training Data 2-Validation Data We run regression model in training data than using this model coefficients to predict validation data. We also check variance inflation factor scores and exclude the variables which is higher then 10.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#Create dummy variebles
auto$bus <- ifelse(auto$vehicleType=="bus",1,0)
auto$cabrio <- ifelse(auto$vehicleType=="cabrio",1,0)
auto$coupe <- ifelse(auto$vehicleType=="coupe",1,0)
auto$kleinwagen <- ifelse(auto$vehicleType=="kleinwagen",1,0)
auto$kombi <- ifelse(auto$vehicleType=="kombi",1,0)
auto$limousine <- ifelse(auto$vehicleType=="limousine",1,0)
auto$suv <- ifelse(auto$vehicleType=="suv",1,0)
auto$automatik <- ifelse(auto$gearbox=="automatik",1,0)
auto$manuell <- ifelse(auto$gearbox=="manuell",1,0)
auto$benzin <- ifelse(auto$fuelType=="benzin",1,0)
auto$diesel <- ifelse(auto$fuelType=="diesel",1,0)
auto$lpg <- ifelse(auto$fuelType=="lpg",1,0)

names(auto)
##  [1] "dateCrawled"         "price"               "vehicleType"        
##  [4] "yearOfRegistration"  "gearbox"             "powerPS"            
##  [7] "model"               "kilometer"           "monthOfRegistration"
## [10] "fuelType"            "brand"               "notRepairedDamage"  
## [13] "dateCreated"         "postalCode"          "lastSeen"           
## [16] "sellingTime"         "age"                 "bus"                
## [19] "cabrio"              "coupe"               "kleinwagen"         
## [22] "kombi"               "limousine"           "suv"                
## [25] "automatik"           "manuell"             "benzin"             
## [28] "diesel"              "lpg"
# Split Data
set.seed(100)
subauto <- sample(nrow(auto), floor(nrow(auto)*0.7))
training_data <- auto[subauto,]
validation_data <- auto[-subauto,]

training_data$age_num<-as.numeric(training_data$age)
validation_data$age_num<-as.numeric(validation_data$age)
# multiple regression model
# built model with all variables
str(training_data$age_num)
##  num [1:210669] 6 7 5 14 16 16 14 12 9 10 ...
fit <- lm(price ~ kilometer + age_num + powerPS+bus+cabrio+coupe+
            kleinwagen+kombi+limousine+suv+
            automatik+manuell+benzin+diesel+lpg, data=training_data)
summary(fit) # show results
## 
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio + 
##     coupe + kleinwagen + kombi + limousine + suv + automatik + 
##     manuell + benzin + diesel + lpg, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17537.2  -1590.1   -252.5   1147.3  23118.2 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  9.888e+03  1.360e+02   72.688  < 2e-16 ***
## kilometer   -3.762e-02  2.003e-04 -187.817  < 2e-16 ***
## age_num     -3.108e+02  1.337e+00 -232.481  < 2e-16 ***
## powerPS      3.596e+01  2.065e-01  174.151  < 2e-16 ***
## bus          1.324e+02  7.213e+01    1.836   0.0663 .  
## cabrio       1.442e+03  7.405e+01   19.472  < 2e-16 ***
## coupe        5.809e+02  7.584e+01    7.659 1.88e-14 ***
## kleinwagen  -4.476e+02  7.104e+01   -6.301 2.97e-10 ***
## kombi       -6.407e+02  7.088e+01   -9.038  < 2e-16 ***
## limousine   -8.112e+01  7.050e+01   -1.151   0.2499    
## suv          1.584e+03  7.743e+01   20.454  < 2e-16 ***
## automatik    2.987e+02  1.741e+01   17.157  < 2e-16 ***
## manuell             NA         NA       NA       NA    
## benzin      -7.943e+02  1.150e+02   -6.904 5.07e-12 ***
## diesel       5.271e+02  1.151e+02    4.581 4.64e-06 ***
## lpg         -1.408e+03  1.255e+02  -11.219  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2595 on 175102 degrees of freedom
##   (35552 observations deleted due to missingness)
## Multiple R-squared:  0.6372, Adjusted R-squared:  0.6372 
## F-statistic: 2.197e+04 on 14 and 175102 DF,  p-value: < 2.2e-16
#using stepwise, reduce variables if posible
fit.step <- stepAIC(fit)
## Start:  AIC=2753302
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe + 
##     kleinwagen + kombi + limousine + suv + automatik + manuell + 
##     benzin + diesel + lpg
## 
## 
## Step:  AIC=2753302
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe + 
##     kleinwagen + kombi + limousine + suv + automatik + benzin + 
##     diesel + lpg
## 
##              Df  Sum of Sq        RSS     AIC
## - limousine   1 8.9131e+06 1.1790e+12 2753301
## <none>                     1.1790e+12 2753302
## - bus         1 2.2699e+07 1.1790e+12 2753303
## - diesel      1 1.4129e+08 1.1791e+12 2753321
## - kleinwagen  1 2.6732e+08 1.1793e+12 2753339
## - benzin      1 3.2095e+08 1.1793e+12 2753347
## - coupe       1 3.9501e+08 1.1794e+12 2753358
## - kombi       1 5.5003e+08 1.1795e+12 2753381
## - lpg         1 8.4752e+08 1.1798e+12 2753425
## - automatik   1 1.9820e+09 1.1810e+12 2753594
## - cabrio      1 2.5529e+09 1.1815e+12 2753678
## - suv         1 2.8170e+09 1.1818e+12 2753717
## - powerPS     1 2.0421e+11 1.3832e+12 2781273
## - kilometer   1 2.3751e+11 1.4165e+12 2785440
## - age_num     1 3.6391e+11 1.5429e+12 2800407
## 
## Step:  AIC=2753301
## price ~ kilometer + age_num + powerPS + bus + cabrio + coupe + 
##     kleinwagen + kombi + suv + automatik + benzin + diesel + 
##     lpg
## 
##              Df  Sum of Sq        RSS     AIC
## <none>                     1.1790e+12 2753301
## - diesel      1 1.4103e+08 1.1791e+12 2753320
## - benzin      1 3.2239e+08 1.1793e+12 2753347
## - bus         1 5.4247e+08 1.1795e+12 2753379
## - lpg         1 8.4930e+08 1.1798e+12 2753425
## - automatik   1 1.9779e+09 1.1810e+12 2753592
## - kleinwagen  1 2.3343e+09 1.1813e+12 2753645
## - coupe       1 2.9411e+09 1.1819e+12 2753735
## - kombi       1 6.4160e+09 1.1854e+12 2754249
## - suv         1 1.4467e+10 1.1935e+12 2755435
## - cabrio      1 2.0938e+10 1.1999e+12 2756382
## - powerPS     1 2.0467e+11 1.3837e+12 2781330
## - kilometer   1 2.3757e+11 1.4166e+12 2785446
## - age_num     1 3.6402e+11 1.5430e+12 2800419
summary(fit.step)
## 
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio + 
##     coupe + kleinwagen + kombi + suv + automatik + benzin + diesel + 
##     lpg, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17536.6  -1590.3   -252.6   1147.3  23117.0 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  9.812e+03  1.189e+02   82.539  < 2e-16 ***
## kilometer   -3.762e-02  2.003e-04 -187.840  < 2e-16 ***
## age_num     -3.107e+02  1.336e+00 -232.516  < 2e-16 ***
## powerPS      3.595e+01  2.062e-01  174.346  < 2e-16 ***
## bus          2.109e+02  2.350e+01    8.976  < 2e-16 ***
## cabrio       1.521e+03  2.728e+01   55.765  < 2e-16 ***
## coupe        6.602e+02  3.159e+01   20.900  < 2e-16 ***
## kleinwagen  -3.691e+02  1.982e+01  -18.620  < 2e-16 ***
## kombi       -5.618e+02  1.820e+01  -30.869  < 2e-16 ***
## suv          1.663e+03  3.587e+01   46.353  < 2e-16 ***
## automatik    2.983e+02  1.740e+01   17.139  < 2e-16 ***
## benzin      -7.960e+02  1.150e+02   -6.920 4.55e-12 ***
## diesel       5.267e+02  1.151e+02    4.577 4.73e-06 ***
## lpg         -1.410e+03  1.255e+02  -11.231  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2595 on 175103 degrees of freedom
##   (35552 observations deleted due to missingness)
## Multiple R-squared:  0.6372, Adjusted R-squared:  0.6372 
## F-statistic: 2.366e+04 on 13 and 175103 DF,  p-value: < 2.2e-16
# multicollinearity
library(fmsb)
library(car)
#variance inflation factor
vif(fit)
fitnew <- lm(price ~ kilometer + age_num + powerPS+bus+cabrio+coupe+
               +kombi+suv+
               automatik+manuell+benzin+diesel+lpg, data=training_data)
summary(fitnew) # show results
## 
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio + 
##     coupe + +kombi + suv + automatik + manuell + benzin + diesel + 
##     lpg, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17597.0  -1589.9   -251.9   1139.9  23136.3 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  9.457e+03  1.175e+02   80.516  < 2e-16 ***
## kilometer   -3.764e-02  2.005e-04 -187.754  < 2e-16 ***
## age_num     -3.063e+02  1.316e+00 -232.727  < 2e-16 ***
## powerPS      3.780e+01  1.808e-01  209.031  < 2e-16 ***
## bus          3.411e+02  2.245e+01   15.192  < 2e-16 ***
## cabrio       1.643e+03  2.652e+01   61.947  < 2e-16 ***
## coupe        7.457e+02  3.128e+01   23.837  < 2e-16 ***
## kombi       -4.587e+02  1.736e+01  -26.433  < 2e-16 ***
## suv          1.751e+03  3.559e+01   49.183  < 2e-16 ***
## automatik    2.773e+02  1.739e+01   15.948  < 2e-16 ***
## manuell             NA         NA       NA       NA    
## benzin      -8.596e+02  1.151e+02   -7.468 8.17e-14 ***
## diesel       5.013e+02  1.152e+02    4.352 1.35e-05 ***
## lpg         -1.461e+03  1.256e+02  -11.631  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2597 on 175104 degrees of freedom
##   (35552 observations deleted due to missingness)
## Multiple R-squared:  0.6365, Adjusted R-squared:  0.6365 
## F-statistic: 2.555e+04 on 12 and 175104 DF,  p-value: < 2.2e-16
library(relaimpo)
calc.relimp(fitnew, type = c("lmg"), rela = TRUE)
vif(fitnew)
# predict values on training set
training_data$predict.price <- predict(fitnew)
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 175117
## items to be assigned to 210669 items of column 'predict.price' (recycled
## leaving remainder of 35552 items).
training_data$error <- residuals((fitnew))
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 175117
## items to be assigned to 210669 items of column 'error' (recycled leaving
## remainder of 35552 items).
# predict values on validation set
validation_data$predict.price <- predict(fitnew, newdata = validation_data)
## Warning in predict.lm(fitnew, newdata = validation_data): prediction from a
## rank-deficient fit may be misleading
validation_data$error <- validation_data$predict.price-validation_data$price
# check residuals plots
hist(training_data$error)

hist(validation_data$error)

ggplot(data =validation_data, aes(x=price, y=predict.price) )+
  geom_point(alpha=1/100)+
  geom_smooth()+
  facet_wrap(~vehicleType)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 15387 rows containing non-finite values (stat_smooth).
## Warning: Removed 15387 rows containing missing values (geom_point).

Data on Map

Using Libraries

library(zipcode)
library(ggmap)

Get USA Map

mapGermany = ggmap(get_map(location = "Germany", zoom = 6), extent="normal")

Postal Code Improvement

# Get postalcode, latitude and longitude of Germany
# The zipcode data url is downloaded from "https://www.aggdata.com/free/germany-postal-codes"
library(data.table)
postalcodeGermany <- read.csv("http://timucin.anuslu.com/wp-content/uploads/sites/2/2017/01/de_postal_codes.txt")
postalcodeGermany$placeName <- NULL
postalcodeGermany$state <- NULL
postalcodeGermany$stateAbbreviation <- NULL
postalcodeGermany$city <- NULL
postalcodeGermany$V8 <- NULL
postalcodeGermany$postalCode <- clean.zipcodes(postalcodeGermany$postalCode)

Merge data with postalcode data

auto_map <- merge(auto, postalcodeGermany, by = "postalCode")

Vehicle Type on Map

The map shows the vehicle density by vehicle type on USA map.

mapGermany+geom_point(data=auto_map, aes(x=longitude, y=latitude, colour=vehicleType, alpha=1/100))+
  labs(x= 'Longitude', y='Latitude', title = 'Vehicle Type on Map')

Price by Vehicle Types on Map

The map shows the vehicle density by vehicle type on USA map.

auto_subbrand_map <- merge(auto_subbrand, postalcodeGermany, by = "postalCode")

p1 <- mapGermany+geom_point(data=subset(auto_subbrand_map, price < 5000), aes(x=longitude, y=latitude, color=price))+
  facet_wrap(~vehicleType) +
  scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50", 
                         guide = "colourbar") +
  labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")

p2 <- mapGermany+geom_point(data=subset(auto_subbrand_map, 5000 <= price & price < 10000), aes(x=longitude, y=latitude, color=price))+
  facet_wrap(~vehicleType) +
  scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50", 
                         guide = "colourbar") +
  labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")

p3 <- mapGermany+geom_point(data=subset(auto_subbrand_map, 10000 <= price), aes(x=longitude, y=latitude, color=price))+
  facet_wrap(~vehicleType) +
  scale_color_continuous(low = "yellow", high = "black", space = "Lab", na.value = "grey50", 
                         guide = "colourbar") +
  labs(x= 'Longitude', y= 'Latitude', title= "Price by Vehicle Type")

grid.arrange(p1, p2, p3, nrow = 3)

Interactive Data Map with Shiny module

Interactive Data Map Interactive Data Map

Conclusion

Thank You

Any suggestions / comments are appreciated.

Thanks for checking out :)