Assignment 3: Diamonds Price Estimation

MEF - BDA 503

Devrim Nesipoglu

4 Ocak 2018

Diamonds dataset

This dataset contains the prices and other attributes for around 50.000 diamons. The features included are the following:

library(rpart)
library(rpart.plot)
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.3.4     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.7.2     <U+221A> stringr 1.2.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.2.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(ggvis)
## 
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
## 
##     resolution
library(dplyr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift



Exploratory Data Analysis

General information about Diamonds dataset:

data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
dim(diamonds)
## [1] 53940    10
##result=>53940 observation  , 10 variable

summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 
##result=>cut ,colar,clarity =>factors
df <- diamonds[,c(1,2,3,4,5,6,7)]
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  7 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
head(df)
## # A tibble: 6 x 7
##   carat       cut color clarity depth table price
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int>
## 1  0.23     Ideal     E     SI2  61.5    55   326
## 2  0.21   Premium     E     SI1  59.8    61   326
## 3  0.23      Good     E     VS1  56.9    65   327
## 4  0.29   Premium     I     VS2  62.4    58   334
## 5  0.31      Good     J     SI2  63.3    58   335
## 6  0.24 Very Good     J    VVS2  62.8    57   336

Data Visualization

ggplot2

ggplot2 is a plotting system for R, based on Leland Wilkinson’s “The grammar of graphics”.

qplot

The qplot function is a convenient wrapper for creating a number of different types of plots using a consistent scheme. It is recommended for simple scatterplots. This function creates a ggplot object which will receive the actual plot.

geom

An important parameter is geom. It specifies the geometric objects that define the graph type. Geom values include:

using the pipe operator:

diamonds %>% qplot(carat, price, data = .)

To color these points, we just need to tell the function that the colors to be used depend on the cut feature, for example, by using the color argument.

Create a scatterplot for comparing price based on diamond cut and color

theme_set(theme_bw())
ggplot(diamonds,aes(x=carat,y=price))+geom_jitter(aes(color=color,shape=cut))+
labs(title="comparison of carat with price based on diamond cut and color")

ggplot(diamonds, aes(x=carat, y=depth, color=clarity)) + geom_point(alpha=0.3) + geom_smooth() +theme_light()+
labs(title="Comparing diamond depth with carat and clarity")
## `geom_smooth()` using method = 'gam'

ggplot(diamonds,aes(x=cut,fill=color))+geom_bar()

ggplot(diamonds,aes(x=price))+geom_density(aes(fill=factor(cut)),alpha=0.7)+labs(title="Price grouped by cut",x="Price",fill="Cut")

ggplot(diamonds,aes(x=clarity,y=depth))+geom_violin(aes(fill=cut))+theme_classic()+labs(title="How Cut impacts depth and clarity")

Predicting diamonds Price with Regression model CART on entire dataset.

Regression Trees are part of the CART family of techniques for prediction of a numerical target feature. rpart package with its CART algorithms is used.

m1 <- rpart(price ~ ., data=df, method ="anova")
m1
## n= 53940 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 53940 858473100000  3932.800  
##    2) carat< 0.995 34880  43459420000  1632.641  
##      4) carat< 0.625 24787   6616641000  1052.009 *
##      5) carat>=0.625 10093   7963754000  3058.593 *
##    3) carat>=0.995 19060 292761600000  8142.115  
##      6) carat< 1.495 12825  60730000000  6139.890  
##       12) clarity=I1,SI2,SI1,VS2 9709  18423680000  5371.052 *
##       13) clarity=VS1,VVS2,VVS1,IF 3116  18685020000  8535.476 *
##      7) carat>=1.495 6235  74861710000 12260.560  
##       14) carat< 1.915 4051  35816530000 10872.790  
##         28) clarity=I1,SI2 987   3684287000  8265.010 *
##         29) clarity=SI1,VS2,VS1,VVS2,VVS1,IF 3064  23258010000 11712.820  
##           58) color=H,I,J 1614   5850799000 10040.000 *
##           59) color=D,E,F,G 1450   7863362000 13574.850 *
##       15) carat>=1.915 2184  16771810000 14834.690 *
rpart.plot(m1, type=3, digits=3, fallen.leaves = TRUE)

Build model on train set and predict on test

#
p1 <- predict(m1, df)
Mean_Absolute_Error <- function(actual, predicted) {mean(abs(actual - predicted))}

Mean_Absolute_Error
## function(actual, predicted) {mean(abs(actual - predicted))}
Mean_Absolute_Error(df$price, p1)
## [1] 833.0197

References

R - Regression Trees - CART video of Jalayer Academy https://www.youtube.com/watch?v=MoBw5PiW56k&t=321s

For Data Visualization https://www.kaggle.com/devisangeetha/data-visualization