AHMET ÖZMEN




The library and data are loading

set.seed(503)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'readr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.2
library(dplyr)
library(ggplot2)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")




Looking at which variables affect price

    ggplot(aes(x=carat,y=price),data=diamonds_train) +   geom_point() + geom_smooth() +
   ggtitle('Price and Carat Relationship')
## `geom_smooth()` using method = 'gam'


In chart aboved mentioned, there is a direct relationship between carat and price, as carat increases, prices also increases.



Also we have to look at the other varibales such as cut,color and clarity as well as carat.

    ggplot(aes(x=carat,y=price),data=diamonds_train) +   geom_point(aes(color=cut)) + scale_color_brewer(type = 'div',
    guide = guide_legend(reverse = T))  +
   ggtitle('Price,Cut and Carat Relationship')


In chart aboved mentioned, Cut of a diamond changes the price. if the cut of a diamond is ideal and premium, price will increase. In other words, if the cut of a dimamond is good and fair, price will decrease.



    ggplot(aes(x=carat,y=price),data=diamonds_train) +   geom_point(aes(color=color)) + scale_color_brewer(type = 'div',
    guide = guide_legend(reverse = T))  +
   ggtitle('Price,Color and Carat Relationship')


In chart aboved mentioned, color of a diamond changes the price. if the color of a diamond is D and E, price will increase. In other words, if the coloer of a dimamond is J and I, price will decrease.



  ggplot(aes(x=carat,y=price),data=diamonds_train) +   geom_point(aes(color=clarity)) + scale_color_brewer(type = 'div',
    guide = guide_legend(reverse = T))  +
   ggtitle('Price,Clarity and Carat Relationship')


In chart aboved mentioned, clarity of a diamond changes the price. if the clarity of a diamond is D and E, price will increase. In other words, if the clarity of a dimamond is J and I, price will decrease.




Linear Model


The model is creating.

model <- lm(log10(price) ~log10(carat) + cut+color+clarity, data = diamonds_train)

summary(model)
## 
## Call:
## lm(formula = log10(price) ~ log10(carat) + cut + color + clarity, 
##     data = diamonds_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43975 -0.03739 -0.00014  0.03618  0.84540 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   3.6729978  0.0005670 6477.833  < 2e-16 ***
## log10(carat)  1.8843104  0.0012603 1495.166  < 2e-16 ***
## cut.L         0.0522537  0.0011416   45.773  < 2e-16 ***
## cut.Q        -0.0152879  0.0010048  -15.214  < 2e-16 ***
## cut.C         0.0060017  0.0008723    6.881 6.04e-12 ***
## cut^4        -0.0009283  0.0006984   -1.329    0.184    
## color.L      -0.1913851  0.0009832 -194.661  < 2e-16 ***
## color.Q      -0.0413600  0.0009030  -45.801  < 2e-16 ***
## color.C      -0.0060990  0.0008449   -7.219 5.34e-13 ***
## color^4       0.0049564  0.0007759    6.388 1.70e-10 ***
## color^5      -0.0007820  0.0007332   -1.067    0.286    
## color^6       0.0006871  0.0006663    1.031    0.302    
## clarity.L     0.3973321  0.0017360  228.883  < 2e-16 ***
## clarity.Q    -0.1059695  0.0016162  -65.566  < 2e-16 ***
## clarity.C     0.0566246  0.0013851   40.882  < 2e-16 ***
## clarity^4    -0.0282896  0.0011074  -25.546  < 2e-16 ***
## clarity^5     0.0119068  0.0009037   13.176  < 2e-16 ***
## clarity^6    -0.0005438  0.0007869   -0.691    0.490    
## clarity^7     0.0146968  0.0006941   21.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05799 on 43124 degrees of freedom
## Multiple R-squared:  0.9827, Adjusted R-squared:  0.9827 
## F-statistic: 1.361e+05 on 18 and 43124 DF,  p-value: < 2.2e-16

Model Interpreting

Model mentioned above explains %98 of the variance. It has highly significant overall model p-value.
Since the model has smaller average, 0.00014, the model is better. In additon, p value is lower than 0.05, so it is significant.



Model Testing

diamonds_test_yeni<-diamonds_test %>% 
  select(carat,cut,color,clarity)

estimate <- predict(model,diamonds_test,
                    level = .95)

predict_result<- round(10^estimate)
result<- data.frame(diamonds_test %>% select(carat,cut,color,clarity,price),predict_result)

result<-result %>% mutate(PriceDifference=abs(round(price-predict_result)), PriceDifferencePercentage=abs(round((price/predict_result-1)*100,2))) %>% arrange(PriceDifferencePercentage)


The top 20 best predicted diamonds

kable(result %>%filter(row_number()<=20)  )
carat cut color clarity price predict_result PriceDifference PriceDifferencePercentage
0.33 Good G VS1 666 666 0 0
0.32 Very Good F VS1 696 696 0 0
0.23 Very Good F VS1 373 373 0 0
1.53 Very Good G SI2 8455 8455 0 0
0.32 Very Good H VVS1 730 730 0 0
0.32 Very Good H VVS1 730 730 0 0
0.31 Premium F VS2 625 625 0 0
0.31 Premium F VS2 625 625 0 0
0.53 Premium G SI2 1173 1173 0 0
0.70 Premium H SI2 1808 1808 0 0
0.32 Ideal E VS1 758 758 0 0
0.33 Ideal E VVS1 986 986 0 0
0.70 Ideal F SI1 2551 2551 0 0
0.31 Ideal G SI2 436 436 0 0
0.31 Ideal G VS1 642 642 0 0
0.54 Ideal G VVS2 2090 2090 0 0
0.32 Ideal G IF 918 918 0 0
0.32 Ideal G IF 918 918 0 0
0.32 Ideal G IF 918 918 0 0
0.32 Ideal G IF 918 918 0 0

The top 20 worst predicted diamonds

resultRowNumber <- nrow(result) -20

kable(result %>%filter(row_number()>=resultRowNumber)  )
carat cut color clarity price predict_result PriceDifference PriceDifferencePercentage
1.01 Ideal H SI2 5828 3686 2142 58.11
1.00 Fair D VVS2 10562 6673 3889 58.28
0.26 Very Good D SI1 658 415 243 58.55
0.96 Fair J I1 2304 1438 866 60.22
0.23 Very Good D SI2 449 279 170 60.93
0.48 Very Good J SI2 1080 670 410 61.19
1.03 Good H SI2 5768 3530 2238 63.40
0.23 Ideal H VS2 512 311 201 64.63
1.50 Very Good G VS2 18691 11165 7526 67.41
1.00 Ideal D VVS1 14498 8403 6095 72.53
1.09 Very Good D IF 18231 10388 7843 75.50
1.01 Premium D IF 16234 9203 7031 76.40
1.01 Very Good D IF 16087 8998 7089 78.78
1.03 Ideal D IF 17590 9758 7832 80.26
1.00 Good D IF 15928 8517 7411 87.01
0.34 Ideal D IF 2346 1209 1137 94.04
0.34 Ideal D IF 2346 1209 1137 94.04
0.35 Fair G VS2 1415 640 775 121.09
0.35 Fair G VS2 1415 640 775 121.09
0.61 Good F SI2 3807 1539 2268 147.37
0.25 Premium G SI2 1186 285 901 316.14

Average of the Price Difference %

result %>% summarise(AverageDifference=round(mean(PriceDifferencePercentage),2) )
##   AverageDifference
## 1              10.6

Reference :
https://rstudio-pubs-static.s3.amazonaws.com/94067_d1fdfafd20b14725a2578647031760c2.html https://classroom.udacity.com/courses/ud651/lessons/755209509/concepts/8624987730923