The aim of this study is to find the price of a diamond given its properties. diamonds data set in ggplot2 is used. Thus, tidyverse collection is loaded.
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.4.3
## Warning: package 'ggplot2' was built under R version 3.4.3
library("corrplot")
## Warning: package 'corrplot' was built under R version 3.4.3
library("ggcorrplot")
library("dummies")
library("pls")
## Warning: package 'pls' was built under R version 3.4.3
The structure of the data can be seen below.
data <- diamonds %>% tbl_df()
glimpse(data)
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
There are 53940 diamonds with 10 features.
We need to check whether there are NA/NaN values in the data set or not.
which(is.na.data.frame(data))
## integer(0)
Since there aren’t any NA/NaN values, no need to omit or complete them. Thus we proceed to population statistics.
summary(data)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
The lightest diamond is 0.2 carats and the heaviest one is 5.01 carats. The majority of the cut quality is over “very good”. The color is peaked in G. The price span expands drom 326$ to 18823$ with an average of 5324$.
{r, echo =FALSE, warning = FALSE,message= FALSE, fig.width = 11, fig.height = 11} # diamonds_samp <- data[sample(1:length(data$price),5000),] # ggpairs(diamonds_samp) #
Cut, color and clarity are categorical variables of this set, so they must be converted to numerical values to check their contribution to the price by means of a correlation matrix.
dataNumeric <- data %>% mutate(colorNum = ifelse(color == "D", 0, ifelse(color == "E", 1, ifelse(color == "F", 2, ifelse(color == "G", 3, ifelse(color == "H",4, ifelse(color == "I" , 5,6)))))), cutNum = ifelse(cut=="Fair", 0, ifelse(cut=="Good", 1, ifelse(cut == "Very Good", 2, ifelse(cut=="Premium", 3, 4)))), clarityNum = ifelse(clarity == "I1", 0, ifelse(clarity == "SI2", 1, ifelse(clarity == "SI1",2, ifelse(clarity=="VS2", 3, ifelse(clarity=="VS1",4, ifelse(clarity=="VVS2",5,ifelse(clarity == "VVS1", 6, 7)))))))) %>% select(-color,-cut, -clarity)
summary(dataNumeric)
## carat depth table price
## Min. :0.2000 Min. :43.00 Min. :43.00 Min. : 326
## 1st Qu.:0.4000 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
## Median :0.7000 Median :61.80 Median :57.00 Median : 2401
## Mean :0.7979 Mean :61.75 Mean :57.46 Mean : 3933
## 3rd Qu.:1.0400 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
## Max. :5.0100 Max. :79.00 Max. :95.00 Max. :18823
## x y z colorNum
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.:1.000
## Median : 5.700 Median : 5.710 Median : 3.530 Median :3.000
## Mean : 5.731 Mean : 5.735 Mean : 3.539 Mean :2.594
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.:4.000
## Max. :10.740 Max. :58.900 Max. :31.800 Max. :6.000
## cutNum clarityNum
## Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000
## Mean :2.904 Mean :3.051
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :4.000 Max. :7.000
Let’s see correlation matrix.
CorrMat <- round(cor(dataNumeric),3)
ggcorrplot(CorrMat, hc.order = TRUE,
#type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of Numeric Values",
ggtheme=theme_bw)
As you can see price is highly correlated with carat at most and then x, y and z dimensions contributes a lot. Besides since carat is highly correlated with dimensional values. Probably carat and dimensional varibles give the same information about the diamond of interest.
Principal component analysis must be run to decide the factors that linear regression formula is built upon.
survey_pca <-princomp(as.matrix(dataNumeric,cor=T))
summary(survey_pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 3989.4031024 2.354466e+00 1.712871e+00 1.695341e+00
## Proportion of Variance 0.9999991 3.483121e-07 1.843458e-07 1.805917e-07
## Cumulative Proportion 0.9999991 9.999994e-01 9.999996e-01 9.999998e-01
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 1.393179e+00 8.255399e-01 6.339003e-01 1.925948e-01
## Proportion of Variance 1.219545e-07 4.282130e-08 2.524792e-08 2.330627e-09
## Cumulative Proportion 9.999999e-01 1.000000e+00 1.000000e+00 1.000000e+00
## Comp.9 Comp.10
## Standard deviation 1.238470e-01 7.244221e-02
## Proportion of Variance 9.637291e-10 3.297366e-10
## Cumulative Proportion 1.000000e+00 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat -0.140 0.143 0.225
## depth 0.217 -0.410 0.773 -0.416
## table -0.922 0.134 0.174 -0.305
## price -1.000
## x -0.177 -0.105 -0.549 0.485 0.564
## y -0.177 -0.111 -0.646 -0.721
## z -0.133 -0.368 0.471 -0.788
## colorNum -0.431 0.880 -0.122 0.159
## cutNum 0.230 0.176 -0.432 -0.847 -0.105
## clarityNum 0.203 0.715 0.458 0.383 -0.292
## Comp.10
## carat 0.950
## depth
## table
## price
## x -0.306
## y
## z
## colorNum
## cutNum
## clarityNum
The principal component explains 99% of the variance in the data by itself, alone. Furthermore, the principal component is one dimensional and actually it is juct the price! Thus, 99% of the variation in this data set can be explained by the price. Since we are interested in estimating the price, we need to find other factors other than price. Thus, the price colon is omitted and a new PCA is run.
dataNum_wo_price <- dataNumeric %>% select(-price)
survey_pca_wo_price <-princomp(as.matrix(dataNum_wo_price,cor=T))
summary(survey_pca_wo_price,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.4553795 1.9827814 1.6963646 1.4440054 1.23206243
## Proportion of Variance 0.3508399 0.2287818 0.1674596 0.1213415 0.08833588
## Cumulative Proportion 0.3508399 0.5796216 0.7470813 0.8684227 0.95675862
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.8238808 0.198185745 0.1293055421 0.0910484339
## Proportion of Variance 0.0395003 0.002285686 0.0009729846 0.0004824104
## Cumulative Proportion 0.9962589 0.998544605 0.9995175896 1.0000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat -0.166 -0.150 -0.229 -0.402 0.849
## depth 0.164 -0.266 0.183 0.787 -0.259 -0.423
## table -0.826 0.426 -0.116 0.170 -0.304
## x -0.244 -0.389 -0.219 -0.352 -0.450 -0.405 -0.500
## y -0.242 -0.394 -0.229 -0.368 0.769
## z -0.139 -0.256 -0.238 -0.390 0.818 0.165
## colorNum -0.118 -0.468 -0.751 0.186 0.411
## cutNum 0.217 -0.450 0.125 -0.853
## clarityNum 0.285 0.359 -0.616 -0.639
When we omit the price colon we see that first four components explains 87% of the variance and they are composed of pretty much every thing else left in the set. Including all the factors in the linear regression formula does not make sense, so may be we should stick on the findings from the correlation matrix. In other words, building a linear model completely based on carat and may be including dimensional variables such as x, y and z as interaction terms.
Alternatively, we can run principal component regression (PCR) based on PCA. This model automatically calculates principal components and then use them as predictors to fit the linear model.
At first, the data set must be splitted into two: one for training, one for testing.
set.seed(42)
diamonds_test <- data %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(data %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
Inside the model, the data is standartized to prevent the algorithm to be skewed towards predictors that are dominant in absolute scale. 10-fold cross-validation is performed.
pcr_model <- pcr(price~.-diamond_id, data = diamonds_train, scale = TRUE, validation = "CV")
summary(pcr_model, loadings = TRUE)
## Data: X dimension: 43143 23
## Y dimension: 43143 1
## Fit method: svdpc
## Number of components considered: 23
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4006 2377 2112 1973 1964 1964 1946
## adjCV 4006 2377 2112 1973 1964 1963 1946
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1936 1849 1849 1810 1808 1708 1621
## adjCV 1937 1849 1849 1810 1808 1708 1621
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 1621 1607 1569 1565 1563 1503
## adjCV 1621 1607 1569 1565 1563 1502
## 20 comps 21 comps 22 comps 23 comps
## CV 1495 1191 1167 1146
## adjCV 1495 1167 1167 1145
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 19.62 29.09 36.90 44.23 50.48 56.48 61.58
## price 64.82 72.22 75.77 75.99 76.01 76.44 76.66
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 66.65 71.25 75.71 79.95 83.62 86.87 89.88
## price 78.74 78.74 79.62 79.67 81.85 83.66 83.67
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## X 92.47 94.46 96.24 97.94 98.85 99.56
## price 83.95 84.70 84.78 84.81 85.97 86.11
## 21 comps 22 comps 23 comps
## X 99.76 99.94 100.00
## price 91.16 91.31 91.94
We draw R2 plot to check the quality of the fit
# Plot the R2
validationplot(pcr_model, val.type = "R2")
It required more components than the number of variables in diamonds data set, in order to explain the 90% of the variance of the price. This is because each categorical variable treated as a new column.
When we fit the model for only numerical variables we get:
pcr_model <- pcr(price~.-diamond_id -cut -clarity -color, data = diamonds_train, scale = TRUE, validation = "CV")
summary(pcr_model, loadings = TRUE)
## Data: X dimension: 43143 6
## Y dimension: 43143 1
## Fit method: svdpc
## Number of components considered: 6
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4006 1808 1799 1767 1559 1545 1548
## adjCV 4006 1808 1799 1767 1543 1548 1544
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 65.54 86.91 98.28 99.07 99.77 100.00
## price 79.65 79.86 80.57 85.13 85.19 85.93
We draw R2 plot again to check the quality of the fit
# Plot the R2
validationplot(pcr_model, val.type = "R2")
Now, it looks like one component is enough to explain more than 90% variability in the data, although CV score is higher with two components. Dimensionality reduction is occured pretty well.
Predictive measures plot is below:
predplot(pcr_model)
Regression coefficients plot is as follows:
coefplot(pcr_model)
Testing the model with one component:
pcr_pred <- predict(pcr_model, diamonds_test, ncomp = 1)
mean((pcr_pred - diamonds_test[,6])^2)
## [1] 27120673
This is a very high error value and increasing the number of components included in the model did not solve the issue.
Testing the model with six components:
pcr_pred <- predict(pcr_model, diamonds_test, ncomp = 6)
mean((pcr_pred - diamonds_test[,6])^2)
## [1] 27868697
PCR model donot work well on diamonds data set may be categorical variables cannot be included and apperantly they are important predictors of the price. I should have tried CART model as recommended.