set.seed(503)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'readr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.2
library(dplyr)
library(ggplot2)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
ggplot(aes(x=carat,y=price),data=diamonds_train) + geom_point() + geom_smooth() +
ggtitle('Price and Carat Relationship')
## `geom_smooth()` using method = 'gam'
In chart aboved mentioned, there is a direct relationship between carat and price, as carat increases, prices also increases.
Also we have to look at the other varibales such as cut,color and clarity as well as carat.
ggplot(aes(x=carat,y=price),data=diamonds_train) + geom_point(aes(color=cut)) + scale_color_brewer(type = 'div',
guide = guide_legend(reverse = T)) +
ggtitle('Price,Cut and Carat Relationship')
In chart aboved mentioned, Cut of a diamond changes the price. if the cut of a diamond is ideal and premium, price will increase. In other words, if the cut of a dimamond is good and fair, price will decrease.
ggplot(aes(x=carat,y=price),data=diamonds_train) + geom_point(aes(color=color)) + scale_color_brewer(type = 'div',
guide = guide_legend(reverse = T)) +
ggtitle('Price,Color and Carat Relationship')
In chart aboved mentioned, color of a diamond changes the price. if the color of a diamond is D and E, price will increase. In other words, if the coloer of a dimamond is J and I, price will decrease.
ggplot(aes(x=carat,y=price),data=diamonds_train) + geom_point(aes(color=clarity)) + scale_color_brewer(type = 'div',
guide = guide_legend(reverse = T)) +
ggtitle('Price,Clarity and Carat Relationship')
In chart aboved mentioned, clarity of a diamond changes the price. if the clarity of a diamond is D and E, price will increase. In other words, if the clarity of a dimamond is J and I, price will decrease.
The model is creating.
model <- lm(log10(price) ~log10(carat) + cut+color+clarity, data = diamonds_train)
summary(model)
##
## Call:
## lm(formula = log10(price) ~ log10(carat) + cut + color + clarity,
## data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43975 -0.03739 -0.00014 0.03618 0.84540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6729978 0.0005670 6477.833 < 2e-16 ***
## log10(carat) 1.8843104 0.0012603 1495.166 < 2e-16 ***
## cut.L 0.0522537 0.0011416 45.773 < 2e-16 ***
## cut.Q -0.0152879 0.0010048 -15.214 < 2e-16 ***
## cut.C 0.0060017 0.0008723 6.881 6.04e-12 ***
## cut^4 -0.0009283 0.0006984 -1.329 0.184
## color.L -0.1913851 0.0009832 -194.661 < 2e-16 ***
## color.Q -0.0413600 0.0009030 -45.801 < 2e-16 ***
## color.C -0.0060990 0.0008449 -7.219 5.34e-13 ***
## color^4 0.0049564 0.0007759 6.388 1.70e-10 ***
## color^5 -0.0007820 0.0007332 -1.067 0.286
## color^6 0.0006871 0.0006663 1.031 0.302
## clarity.L 0.3973321 0.0017360 228.883 < 2e-16 ***
## clarity.Q -0.1059695 0.0016162 -65.566 < 2e-16 ***
## clarity.C 0.0566246 0.0013851 40.882 < 2e-16 ***
## clarity^4 -0.0282896 0.0011074 -25.546 < 2e-16 ***
## clarity^5 0.0119068 0.0009037 13.176 < 2e-16 ***
## clarity^6 -0.0005438 0.0007869 -0.691 0.490
## clarity^7 0.0146968 0.0006941 21.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05799 on 43124 degrees of freedom
## Multiple R-squared: 0.9827, Adjusted R-squared: 0.9827
## F-statistic: 1.361e+05 on 18 and 43124 DF, p-value: < 2.2e-16
Model mentioned above explains %98 of the variance. It has highly significant overall model p-value.
Since the model has smaller average, 0.00014, the model is better. In additon, p value is lower than 0.05, so it is significant.
diamonds_test_yeni<-diamonds_test %>%
select(carat,cut,color,clarity)
estimate <- predict(model,diamonds_test,
level = .95)
predict_result<- round(10^estimate)
result<- data.frame(diamonds_test %>% select(carat,cut,color,clarity,price),predict_result)
result<-result %>% mutate(PriceDifference=abs(round(price-predict_result)), PriceDifferencePercentage=abs(round((price/predict_result-1)*100,2))) %>% arrange(PriceDifferencePercentage)
kable(result %>%filter(row_number()<=20) )
carat | cut | color | clarity | price | predict_result | PriceDifference | PriceDifferencePercentage |
---|---|---|---|---|---|---|---|
0.33 | Good | G | VS1 | 666 | 666 | 0 | 0 |
0.32 | Very Good | F | VS1 | 696 | 696 | 0 | 0 |
0.23 | Very Good | F | VS1 | 373 | 373 | 0 | 0 |
1.53 | Very Good | G | SI2 | 8455 | 8455 | 0 | 0 |
0.32 | Very Good | H | VVS1 | 730 | 730 | 0 | 0 |
0.32 | Very Good | H | VVS1 | 730 | 730 | 0 | 0 |
0.31 | Premium | F | VS2 | 625 | 625 | 0 | 0 |
0.31 | Premium | F | VS2 | 625 | 625 | 0 | 0 |
0.53 | Premium | G | SI2 | 1173 | 1173 | 0 | 0 |
0.70 | Premium | H | SI2 | 1808 | 1808 | 0 | 0 |
0.32 | Ideal | E | VS1 | 758 | 758 | 0 | 0 |
0.33 | Ideal | E | VVS1 | 986 | 986 | 0 | 0 |
0.70 | Ideal | F | SI1 | 2551 | 2551 | 0 | 0 |
0.31 | Ideal | G | SI2 | 436 | 436 | 0 | 0 |
0.31 | Ideal | G | VS1 | 642 | 642 | 0 | 0 |
0.54 | Ideal | G | VVS2 | 2090 | 2090 | 0 | 0 |
0.32 | Ideal | G | IF | 918 | 918 | 0 | 0 |
0.32 | Ideal | G | IF | 918 | 918 | 0 | 0 |
0.32 | Ideal | G | IF | 918 | 918 | 0 | 0 |
0.32 | Ideal | G | IF | 918 | 918 | 0 | 0 |
resultRowNumber <- nrow(result) -20
kable(result %>%filter(row_number()>=resultRowNumber) )
carat | cut | color | clarity | price | predict_result | PriceDifference | PriceDifferencePercentage |
---|---|---|---|---|---|---|---|
1.01 | Ideal | H | SI2 | 5828 | 3686 | 2142 | 58.11 |
1.00 | Fair | D | VVS2 | 10562 | 6673 | 3889 | 58.28 |
0.26 | Very Good | D | SI1 | 658 | 415 | 243 | 58.55 |
0.96 | Fair | J | I1 | 2304 | 1438 | 866 | 60.22 |
0.23 | Very Good | D | SI2 | 449 | 279 | 170 | 60.93 |
0.48 | Very Good | J | SI2 | 1080 | 670 | 410 | 61.19 |
1.03 | Good | H | SI2 | 5768 | 3530 | 2238 | 63.40 |
0.23 | Ideal | H | VS2 | 512 | 311 | 201 | 64.63 |
1.50 | Very Good | G | VS2 | 18691 | 11165 | 7526 | 67.41 |
1.00 | Ideal | D | VVS1 | 14498 | 8403 | 6095 | 72.53 |
1.09 | Very Good | D | IF | 18231 | 10388 | 7843 | 75.50 |
1.01 | Premium | D | IF | 16234 | 9203 | 7031 | 76.40 |
1.01 | Very Good | D | IF | 16087 | 8998 | 7089 | 78.78 |
1.03 | Ideal | D | IF | 17590 | 9758 | 7832 | 80.26 |
1.00 | Good | D | IF | 15928 | 8517 | 7411 | 87.01 |
0.34 | Ideal | D | IF | 2346 | 1209 | 1137 | 94.04 |
0.34 | Ideal | D | IF | 2346 | 1209 | 1137 | 94.04 |
0.35 | Fair | G | VS2 | 1415 | 640 | 775 | 121.09 |
0.35 | Fair | G | VS2 | 1415 | 640 | 775 | 121.09 |
0.61 | Good | F | SI2 | 3807 | 1539 | 2268 | 147.37 |
0.25 | Premium | G | SI2 | 1186 | 285 | 901 | 316.14 |
result %>% summarise(AverageDifference=round(mean(PriceDifferencePercentage),2) )
## AverageDifference
## 1 10.6
Reference :
https://rstudio-pubs-static.s3.amazonaws.com/94067_d1fdfafd20b14725a2578647031760c2.html https://classroom.udacity.com/courses/ud651/lessons/755209509/concepts/8624987730923