Explanatory Data Analysis

The aim of this study is to find the price of a diamond given its properties. diamonds data set in ggplot2 is used. Thus, tidyverse collection is loaded.

library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.4.3
## Warning: package 'ggplot2' was built under R version 3.4.3
library("corrplot")
## Warning: package 'corrplot' was built under R version 3.4.3
library("ggcorrplot")
library("dummies")
library("pls")
## Warning: package 'pls' was built under R version 3.4.3

The structure of the data can be seen below.

data <- diamonds %>% tbl_df()
glimpse(data)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

There are 53940 diamonds with 10 features.

We need to check whether there are NA/NaN values in the data set or not.

which(is.na.data.frame(data))
## integer(0)

Since there aren’t any NA/NaN values, no need to omit or complete them. Thus we proceed to population statistics.

summary(data)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

The lightest diamond is 0.2 carats and the heaviest one is 5.01 carats. The majority of the cut quality is over “very good”. The color is peaked in G. The price span expands drom 326$ to 18823$ with an average of 5324$.

{r, echo =FALSE, warning = FALSE,message= FALSE, fig.width = 11, fig.height = 11} # diamonds_samp <- data[sample(1:length(data$price),5000),] # ggpairs(diamonds_samp) #

Cut, color and clarity are categorical variables of this set, so they must be converted to numerical values to check their contribution to the price by means of a correlation matrix.

dataNumeric <- data  %>% mutate(colorNum = ifelse(color == "D", 0, ifelse(color == "E", 1, ifelse(color == "F", 2, ifelse(color == "G", 3, ifelse(color == "H",4, ifelse(color == "I" , 5,6)))))), cutNum = ifelse(cut=="Fair", 0, ifelse(cut=="Good", 1, ifelse(cut == "Very Good", 2, ifelse(cut=="Premium", 3, 4)))), clarityNum = ifelse(clarity == "I1", 0, ifelse(clarity == "SI2", 1, ifelse(clarity == "SI1",2, ifelse(clarity=="VS2", 3, ifelse(clarity=="VS1",4, ifelse(clarity=="VVS2",5,ifelse(clarity == "VVS1", 6, 7)))))))) %>% select(-color,-cut, -clarity)
summary(dataNumeric)
##      carat            depth           table           price      
##  Min.   :0.2000   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  1st Qu.:0.4000   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  Median :0.7000   Median :61.80   Median :57.00   Median : 2401  
##  Mean   :0.7979   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  3rd Qu.:1.0400   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  Max.   :5.0100   Max.   :79.00   Max.   :95.00   Max.   :18823  
##        x                y                z             colorNum    
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910   1st Qu.:1.000  
##  Median : 5.700   Median : 5.710   Median : 3.530   Median :3.000  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539   Mean   :2.594  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040   3rd Qu.:4.000  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800   Max.   :6.000  
##      cutNum        clarityNum   
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000  
##  Mean   :2.904   Mean   :3.051  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :4.000   Max.   :7.000

Let’s see correlation matrix.

CorrMat <- round(cor(dataNumeric),3)
ggcorrplot(CorrMat, hc.order = TRUE, 
           #type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Numeric Values", 
           ggtheme=theme_bw)

As you can see price is highly correlated with carat at most and then x, y and z dimensions contributes a lot. Besides since carat is highly correlated with dimensional values. Probably carat and dimensional varibles give the same information about the diamond of interest.

Principal Component Analysis

Principal component analysis must be run to decide the factors that linear regression formula is built upon.

survey_pca <-princomp(as.matrix(dataNumeric,cor=T))
summary(survey_pca,loadings=TRUE)
## Importance of components:
##                              Comp.1       Comp.2       Comp.3       Comp.4
## Standard deviation     3989.4031024 2.354466e+00 1.712871e+00 1.695341e+00
## Proportion of Variance    0.9999991 3.483121e-07 1.843458e-07 1.805917e-07
## Cumulative Proportion     0.9999991 9.999994e-01 9.999996e-01 9.999998e-01
##                              Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     1.393179e+00 8.255399e-01 6.339003e-01 1.925948e-01
## Proportion of Variance 1.219545e-07 4.282130e-08 2.524792e-08 2.330627e-09
## Cumulative Proportion  9.999999e-01 1.000000e+00 1.000000e+00 1.000000e+00
##                              Comp.9      Comp.10
## Standard deviation     1.238470e-01 7.244221e-02
## Proportion of Variance 9.637291e-10 3.297366e-10
## Cumulative Proportion  1.000000e+00 1.000000e+00
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat                                                -0.140  0.143  0.225
## depth              0.217 -0.410         0.773 -0.416                     
## table             -0.922  0.134         0.174 -0.305                     
## price      -1.000                                                        
## x                        -0.177        -0.105        -0.549  0.485  0.564
## y                        -0.177        -0.111        -0.646 -0.721       
## z                        -0.133                      -0.368  0.471 -0.788
## colorNum                 -0.431  0.880 -0.122         0.159              
## cutNum             0.230  0.176        -0.432 -0.847 -0.105              
## clarityNum         0.203  0.715  0.458  0.383        -0.292              
##            Comp.10
## carat       0.950 
## depth             
## table             
## price             
## x          -0.306 
## y                 
## z                 
## colorNum          
## cutNum            
## clarityNum

The principal component explains 99% of the variance in the data by itself, alone. Furthermore, the principal component is one dimensional and actually it is juct the price! Thus, 99% of the variation in this data set can be explained by the price. Since we are interested in estimating the price, we need to find other factors other than price. Thus, the price colon is omitted and a new PCA is run.

dataNum_wo_price <- dataNumeric %>% select(-price)
survey_pca_wo_price <-princomp(as.matrix(dataNum_wo_price,cor=T))
summary(survey_pca_wo_price,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     2.4553795 1.9827814 1.6963646 1.4440054 1.23206243
## Proportion of Variance 0.3508399 0.2287818 0.1674596 0.1213415 0.08833588
## Cumulative Proportion  0.3508399 0.5796216 0.7470813 0.8684227 0.95675862
##                           Comp.6      Comp.7       Comp.8       Comp.9
## Standard deviation     0.8238808 0.198185745 0.1293055421 0.0910484339
## Proportion of Variance 0.0395003 0.002285686 0.0009729846 0.0004824104
## Cumulative Proportion  0.9962589 0.998544605 0.9995175896 1.0000000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## carat             -0.166               -0.150        -0.229 -0.402  0.849
## depth       0.164 -0.266  0.183  0.787 -0.259 -0.423                     
## table      -0.826  0.426 -0.116  0.170        -0.304                     
## x          -0.244 -0.389        -0.219 -0.352        -0.450 -0.405 -0.500
## y          -0.242 -0.394        -0.229 -0.368         0.769              
## z          -0.139 -0.256               -0.238        -0.390  0.818  0.165
## colorNum   -0.118 -0.468 -0.751  0.186  0.411                            
## cutNum      0.217               -0.450  0.125 -0.853                     
## clarityNum  0.285  0.359 -0.616        -0.639

When we omit the price colon we see that first four components explains 87% of the variance and they are composed of pretty much every thing else left in the set. Including all the factors in the linear regression formula does not make sense, so may be we should stick on the findings from the correlation matrix. In other words, building a linear model completely based on carat and may be including dimensional variables such as x, y and z as interaction terms.

Principal Component Regression

Alternatively, we can run principal component regression (PCR) based on PCA. This model automatically calculates principal components and then use them as predictors to fit the linear model.

At first, the data set must be splitted into two: one for training, one for testing.

set.seed(42)
diamonds_test <- data %>% mutate(diamond_id = row_number()) %>% 
    group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()

diamonds_train <- anti_join(data %>% mutate(diamond_id = row_number()), 
    diamonds_test, by = "diamond_id")

Training the model

Inside the model, the data is standartized to prevent the algorithm to be skewed towards predictors that are dominant in absolute scale. 10-fold cross-validation is performed.

pcr_model <- pcr(price~.-diamond_id, data = diamonds_train, scale = TRUE, validation = "CV")
summary(pcr_model, loadings = TRUE)
## Data:    X dimension: 43143 23 
##  Y dimension: 43143 1
## Fit method: svdpc
## Number of components considered: 23
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4006     2377     2112     1973     1964     1964     1946
## adjCV         4006     2377     2112     1973     1964     1963     1946
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1936     1849     1849      1810      1808      1708      1621
## adjCV     1937     1849     1849      1810      1808      1708      1621
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV         1621      1607      1569      1565      1563      1503
## adjCV      1621      1607      1569      1565      1563      1502
##        20 comps  21 comps  22 comps  23 comps
## CV         1495      1191      1167      1146
## adjCV      1495      1167      1167      1145
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        19.62    29.09    36.90    44.23    50.48    56.48    61.58
## price    64.82    72.22    75.77    75.99    76.01    76.44    76.66
##        8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X        66.65    71.25     75.71     79.95     83.62     86.87     89.88
## price    78.74    78.74     79.62     79.67     81.85     83.66     83.67
##        15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## X         92.47     94.46     96.24     97.94     98.85     99.56
## price     83.95     84.70     84.78     84.81     85.97     86.11
##        21 comps  22 comps  23 comps
## X         99.76     99.94    100.00
## price     91.16     91.31     91.94

We draw R2 plot to check the quality of the fit

# Plot the R2
validationplot(pcr_model, val.type = "R2")

It required more components than the number of variables in diamonds data set, in order to explain the 90% of the variance of the price. This is because each categorical variable treated as a new column.

When we fit the model for only numerical variables we get:

pcr_model <- pcr(price~.-diamond_id -cut -clarity -color, data = diamonds_train, scale = TRUE, validation = "CV")
summary(pcr_model, loadings = TRUE)
## Data:    X dimension: 43143 6 
##  Y dimension: 43143 1
## Fit method: svdpc
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4006     1808     1799     1767     1559     1545     1548
## adjCV         4006     1808     1799     1767     1543     1548     1544
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X        65.54    86.91    98.28    99.07    99.77   100.00
## price    79.65    79.86    80.57    85.13    85.19    85.93

We draw R2 plot again to check the quality of the fit

# Plot the R2
validationplot(pcr_model, val.type = "R2")

Now, it looks like one component is enough to explain more than 90% variability in the data, although CV score is higher with two components. Dimensionality reduction is occured pretty well.

Predictive measures plot is below:

predplot(pcr_model)

Regression coefficients plot is as follows:

coefplot(pcr_model)

Testing the model with one component:

pcr_pred <- predict(pcr_model, diamonds_test, ncomp = 1)
mean((pcr_pred - diamonds_test[,6])^2)
## [1] 27120673

This is a very high error value and increasing the number of components included in the model did not solve the issue.

Testing the model with six components:

pcr_pred <- predict(pcr_model, diamonds_test, ncomp = 6)
mean((pcr_pred - diamonds_test[,6])^2)
## [1] 27868697

PCR model donot work well on diamonds data set may be categorical variables cannot be included and apperantly they are important predictors of the price. I should have tried CART model as recommended.