Team Members of Data Crunchers:

Background

According to Forbes, the industry has grown by a massive 68% since hitting a trough during the 2009 global financial crises according to a report published by car auction company Manheim earlier this year.

Q3 2016 closed with 9.8M vehicles sold in the used market -an increase of 3.3% over the previous year. Also the average retail used vehicle sold for $19.232 in Q3 2016, an increase of 4.3% over last year. This increase in overall values is fueled by the younger age (4.0 years on average) of vehicles sold at franchise dealers.

Changes in new car buying behavior are beginning to alter the landscape of franchised used vehicles. full article

So both franchised used car firms and other giant online marketplaces like E-Bay are leveraging the growth rate of used car industy. As a result, we tried to understand this market and its dynamics with the help of ‘Used Car Database’.

Data Analysis

Our report explores Kaggle’s ‘Used Car Database’ scraped from Ebay-Kleinanzeigen (in German) containing prices and attributes for approximately 370.000 second-hand cars of 40 unique brands. Data contains ads that created between March 2015 and April 2016. Also the data crawled between March 5, 2016 and April 7, 2016.

Our aim is to understand this growing used car market. When we look at used cars, price is the most important factor that influences opinions, but there are also few other facts that affect the purchase decision. So we tried to find out which variables affect the price most and how they do it.

The used car database not only provides vast amount of observations but also presents several variables for exploratory data analysis. We removed some variables, generated others, cleaned the data, formed upper and lower limits, justified some of our initial guesses while failing in others. In addition to several univariate, bivariate and multivariate graphs, we ran a linear regression analysis and worked on maps to make sense of the zipcode.

First of all, we reduced the number of variables from 20 to 15 to explore the dataset. The variables ‘Name’, ‘Offer Type’, ‘Seller’, ‘Number of Pictures’ and ‘AB Test’ are eliminated from the analysis based on sum() and table() queries. According to our analysis, these variables have no significant impact on explaining the data. For example, vast majority of the second-hand car ads have pictures (no distinction) while names of the ads were difficult to analyze (user generated content).

Apart from the elimination, we refined and standardized the variables which include date and time) ‘Date Crawled’, ‘Date Created’, ‘Last Seen’) by “lubridate” package.

That makes the First part of data cleaning is complete.

Data Cleaning

Using Libraries

library(data.table)
library(ggplot2)
library(lubridate)
library(zipcode)
library(dplyr)

Data Getting

setwd("~/bda503/project")
auto <- fread("autos.csv", stringsAsFactors = T)
str(auto)
## Classes 'data.table' and 'data.frame':   371824 obs. of  20 variables:
##  $ dateCrawled        : Factor w/ 280652 levels "2016-03-05 14:06:22",..: 164643 164374 76603 106256 231281 271106 243057 141578 275651 103170 ...
##  $ name               : Factor w/ 233701 levels "!!!!!!!!!!!!!!!!!!!!!!!!__GOLF_3_CABRIO___!!!!!!!!!!!!!!!!!!!!!!!",..: 79317 4177 90686 76703 170641 27425 145443 188508 64303 190556 ...
##  $ seller             : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
##  $ offerType          : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
##  $ vehicleType        : Factor w/ 9 levels "","andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ powerPS            : int  0 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 252 levels "","100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  0 5 8 6 7 10 8 7 8 0 ...
##  $ fuelType           : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
##  $ dateCreated        : Factor w/ 114 levels "2014-03-10 00:00:00",..: 100 100 90 93 107 111 108 97 111 93 ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : Factor w/ 182904 levels "2016-03-05 14:15:08",..: 178938 178454 163054 45304 172012 175601 165721 89623 157785 126444 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Cleaning Empty Data and Convert Date

auto$offerType <- NULL
auto$name <- NULL
auto$seller <- NULL
auto$nrOfPictures <- NULL
auto$abtest <- NULL

auto$dateCrawled <- ymd_hms(auto$dateCrawled) 
auto$dateCreated <- ymd_hms(auto$dateCreated)
auto$lastSeen <- ymd_hms(auto$lastSeen)

Price Cleaning

We transformed the price by correcting outliers with quantile(). After experimenting with the limits, we decided that “quantile(autoprice,0.05),quantile(autoprice, 0.95)” is the best of the both worlds and the data cleaned accordingly. Then, we started to explore what might affect the second-hand cars’ prices.

summary(auto$price)
quantile(auto$price, 0.90)
quantile(auto$price, 0.99)
quantile(auto$price, 0.01)
quantile(auto$price, 0.05)
quantile(auto$price, 0.10)

ggplot(aes(x=vehicleType, y=price), data = auto) + 
  geom_boxplot() +
  ylim(quantile(auto$price, 0.05), quantile(auto$price, 0.95))


p1 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot()

p2 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.99))

p3 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.95))

p4 <- ggplot(aes(x="price", y=price), data = auto) + 
  geom_boxplot() +
  ylim(0, quantile(auto$price, 0.90))

library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
auto <- auto[(price > quantile(auto$price, 0.05)) & (price < quantile(auto$price, 0.95))]
ggplot(aes(x=vehicleType, y=price), data = auto) + 
  geom_boxplot()

Engine Power Cleaning

We observed that the engine power (PowerPS) column included extreme outliers, therefore after our first look we corrected them with quantile(). The sample value of engine power is set as NA when the value is lower than 40.

summary(auto$powerPS)

p1 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot()
p2 <- ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot() +
  ylim(quantile(auto$powerPS, 0.05), quantile(auto$powerPS, 0.95))

grid.arrange(p1, p2, ncol = 1)
auto[(powerPS < quantile(powerPS, 0.05)) | (powerPS > quantile(powerPS, 0.95)), powerPS := NA]
auto[powerPS  < 40, powerPS := NA]
ggplot(aes(x=vehicleType, y=powerPS), data = auto) + 
  geom_boxplot()

Vehicle Type Cleaning by Removing Empty Cells

Vehicle Type is one of the fundamental categorical variables that has an impact on price (e.g. SUV vs. kleinwagen). Therefore, all the blank rows (vehicleType==“”) are cleaned.

summary(auto$vehicleType)
auto <- auto[vehicleType != ""]
summary(auto$vehicleType)

Year of Registration Cleaning

After visualizing and exploring the raw version of ‘Year of Registration’, we decided the limits to be 1975 and 2016.

summary(auto$yearOfRegistration)
ggplot(aes(x="yearOfRegistration", y=yearOfRegistration), data = auto) + 
  geom_boxplot()

ggplot(aes(x=vehicleType, y=yearOfRegistration), data = auto) + 
  geom_boxplot() +
  ylim(1975, 2016)

ggplot(aes(x=yearOfRegistration, y=price, alpha = 1/100), data = auto) + 
  geom_point() +
  facet_wrap(~vehicleType) +
  xlim(1975, 2016)+
  geom_smooth()
auto <- auto[(yearOfRegistration >= 1975) & (yearOfRegistration < 2016)]

Month of Registration Cleaning

We did not clean our dataset according to every variable, rather used data=subset() whenever necessary. The month of registration (month in which the second-hand ad was put on the website) is not analyzed in depth as well. Did not possess strong relationship with any of the fundamental variables, most importantly price.

summary(auto$monthOfRegistration)
table(auto$monthOfRegistration, auto$yearOfRegistration)
table(auto$monthOfRegistration)
auto[monthOfRegistration == 0, monthOfRegistration := NA]
table(auto$monthOfRegistration)
str(auto$monthOfRegistration)

Fuel Type Cleaning

The blank fuel type rows are set as NA, after all we expected fuel type to have considerable impact on second-hand car prices.

summary(auto$fuelType)
table(auto$vehicleType, auto$fuelType)
auto[fuelType == "", fuelType := NA]
table(auto$vehicleType, auto$fuelType)

Brand Checking to Clean

When we first observed the ‘brand’ variable, we saw that almost 40 different brands exist. This abundance motivated us to focus on vehicle type and a cluster of most popular brands. Later on, we developed auto_subbrand dataset for analysis that focuses on top ten brands. We will come to this shortly.

summary(auto$brand)
table(auto$vehicleType, auto$fuelType)

Gearbox Cleaning

summary(auto$gearbox)
auto[gearbox == "", gearbox := NA]
summary(auto$gearbox)

Model Cleaning

The Model variable is not used much because of the extreme number of combination involving both brand and vehicle type.

summary(auto$model)
auto[model == "", model := NA]
summary(auto$model)

Kilometer Checking to Clean

The concentration on 150.000 kilometer attracted our attention but we kept it all at first.

summary(auto$kilometer)
ggplot(aes(x=vehicleType, y=kilometer), data = auto) + 
  geom_boxplot()

Not Repaired Damage Checking to Clean

NotRepairedDamaged variable is not utilized at all due to its limited structure (Yes/No). It was not possible to measure the degree of the damage on the price.

summary(auto$notRepairedDamage)
table(auto$notRepairedDamage)
auto[notRepairedDamage == "", notRepairedDamage := NA]
table(auto$notRepairedDamage)
str(auto$notRepairedDamage)

Selling Time Variable Creating and Cleaning

We added a new variable as ‘Selling Time’ into the dataset. ‘Selling Time’ is derived from ‘Date Created’ and ‘Last Seen’ by using as.factor().

auto$sellingTime <- as.integer(as.Date(auto$lastSeen) - as.Date(auto$dateCreated))
str(auto$sellingTime)
summary(auto$sellingTime)
table(auto$sellingTime)
max(auto$sellingTime)
min(auto$sellingTime)
auto[sellingTime  > 35, sellingTime := NA]
table(max(auto$sellingTime) - auto$sellingTime)

Age Variable Creating by Using Year of Registration

We added a new variable as ‘Age’ into the dataset. ‘Age’ is derived from ‘Year of Registration’ by using as.factor().

auto$age <- as.factor(year(today()) - auto$yearOfRegistration)

Postal Code Cleaning

The postal code dataset is standardized by using clean.zipcodes.

str(auto$postalCode)
auto$postalCode <- as.factor(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)
auto$postalCode = clean.zipcodes(auto$postalCode)
str(auto$postalCode)
summary(auto$postalCode)

Exploratory Data Analysis

Univariate Analysis

Using Libraries

library(data.table)
library(ggplot2)
library(lubridate)
library(dplyr)
library(gridExtra)

Vehicle Type Frequency Diagram

Limousine, kombi and kleinwagen are the most popular vehicle types in the second-hand market according to our dataset.

ggplot(auto, aes(x=vehicleType)) + 
  geom_bar(fill='darkgreen', color='black') +
  scale_fill_brewer(type= 'div') +
  labs(x= 'Vehicle Type', y= 'number of cars') +
  ggtitle('Vehicle Type Frequency Diagram')

Engine Power Histogram

The mean and median of the PowerPS is around 105.

ggplot(auto, aes(auto$powerPS)) +
  geom_histogram(fill= I('#F79420'), color='black', binwidth=15) +
  labs(x= 'engine power', y= 'number of cars') +
  ggtitle('Histogram of Engine Power (PowerPS)')
## Warning: Removed 38011 rows containing non-finite values (stat_bin).

Selling Time Histogram

We noticed that majority of the second-hand cars are sold only within 35 days. The ratio of the first 10 days (day 0 stands for same day sale) is quite high. This shows us that either Ebay-Kleinanzeigen is very successful at targeting customers or the second-hand market is more fluid that we actually thought.

#Selling Time Histogram
ggplot(data=auto, aes(auto$sellingTime)) + 
  geom_histogram(breaks=seq(0, 35, by = 5), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram for Selling Time") +
  labs(x="Selling Time", y="Count")
## Warning: Removed 294 rows containing non-finite values (stat_bin).

Gearbox Frequency Diagram

The number of cars with manual gearbox is higher than the automatic ones. This is not surprising considering the ages and kilometers of the second-hand cars.

#Gearbox
ggplot(aes(x= gearbox), data=subset(auto, !is.na(gearbox))) +
  geom_bar(color='black', fill='orange') +
  labs(x= 'Gearbox', y='Number of Cars', title='Gearbox')

Kilometer Histogram

The concentration on the 100.000+ km, particularly 150.000 km, is interesting, even that scale_y_log10() becomes necessary to observe the distribution.

ggplot(aes(auto$kilometer), data=auto) +
  geom_bar(color='black', fill='orange') +
  scale_y_log10()

  labs(x= 'Kilometer', y='Number of Cars', title='Kilometer Histogram')
## $x
## [1] "Kilometer"
## 
## $y
## [1] "Number of Cars"
## 
## $title
## [1] "Kilometer Histogram"
## 
## attr(,"class")
## [1] "labels"

Age Histogram

Age variable shows normal distribution with mean and median of approximately 14 years.

ggplot(aes(as.integer(auto$age)), data=auto) +
  geom_histogram(color='black', fill='brown') +
  scale_x_continuous(limit=c(0, 35), breaks=seq(0, 35, 2)) +
  labs(x= 'Car Age', y= 'Number of Cars', title= 'Car Age Histogram')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1214 rows containing non-finite values (stat_bin).

Fuel Type Frequency Diagram

Majority of the second hand cars’ fuel type are benzin or diesel (as expected) but hybrid, elektro, cng stands for emergence of new trends in the second-hand car market.

ggplot(aes(x=fuelType), data=subset(auto, !is.na(fuelType))) +
  geom_bar(aes(fill=fuelType), color='black') +
  labs(x= 'Fuel Type', y='Number of Cars', title= 'Fuel Type Frequency Diagram')