Devrim Nesipoglu
4 Ocak 2018
This dataset contains the prices and other attributes for around 50.000 diamons. The features included are the following:
library(rpart)
library(rpart.plot)
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.3.4 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.7.2 <U+221A> stringr 1.2.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.2.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(ggvis)
##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
library(dplyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
General information about Diamonds dataset:
data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
dim(diamonds)
## [1] 53940 10
##result=>53940 observation , 10 variable
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
##result=>cut ,colar,clarity =>factors
df <- diamonds[,c(1,2,3,4,5,6,7)]
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 7 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
head(df)
## # A tibble: 6 x 7
## carat cut color clarity depth table price
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
## 1 0.23 Ideal E SI2 61.5 55 326
## 2 0.21 Premium E SI1 59.8 61 326
## 3 0.23 Good E VS1 56.9 65 327
## 4 0.29 Premium I VS2 62.4 58 334
## 5 0.31 Good J SI2 63.3 58 335
## 6 0.24 Very Good J VVS2 62.8 57 336
ggplot2 is a plotting system for R, based on Leland Wilkinson’s “The grammar of graphics”.
The qplot function is a convenient wrapper for creating a number of different types of plots using a consistent scheme. It is recommended for simple scatterplots. This function creates a ggplot object which will receive the actual plot.
An important parameter is geom. It specifies the geometric objects that define the graph type. Geom values include:
using the pipe operator:
diamonds %>% qplot(carat, price, data = .)
To color these points, we just need to tell the function that the colors to be used depend on the cut feature, for example, by using the color argument.
theme_set(theme_bw())
ggplot(diamonds,aes(x=carat,y=price))+geom_jitter(aes(color=color,shape=cut))+
labs(title="comparison of carat with price based on diamond cut and color")
ggplot(diamonds, aes(x=carat, y=depth, color=clarity)) + geom_point(alpha=0.3) + geom_smooth() +theme_light()+
labs(title="Comparing diamond depth with carat and clarity")
## `geom_smooth()` using method = 'gam'
ggplot(diamonds,aes(x=cut,fill=color))+geom_bar()
ggplot(diamonds,aes(x=price))+geom_density(aes(fill=factor(cut)),alpha=0.7)+labs(title="Price grouped by cut",x="Price",fill="Cut")
ggplot(diamonds,aes(x=clarity,y=depth))+geom_violin(aes(fill=cut))+theme_classic()+labs(title="How Cut impacts depth and clarity")
Regression Trees are part of the CART family of techniques for prediction of a numerical target feature. rpart package with its CART algorithms is used.
m1 <- rpart(price ~ ., data=df, method ="anova")
m1
## n= 53940
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 53940 858473100000 3932.800
## 2) carat< 0.995 34880 43459420000 1632.641
## 4) carat< 0.625 24787 6616641000 1052.009 *
## 5) carat>=0.625 10093 7963754000 3058.593 *
## 3) carat>=0.995 19060 292761600000 8142.115
## 6) carat< 1.495 12825 60730000000 6139.890
## 12) clarity=I1,SI2,SI1,VS2 9709 18423680000 5371.052 *
## 13) clarity=VS1,VVS2,VVS1,IF 3116 18685020000 8535.476 *
## 7) carat>=1.495 6235 74861710000 12260.560
## 14) carat< 1.915 4051 35816530000 10872.790
## 28) clarity=I1,SI2 987 3684287000 8265.010 *
## 29) clarity=SI1,VS2,VS1,VVS2,VVS1,IF 3064 23258010000 11712.820
## 58) color=H,I,J 1614 5850799000 10040.000 *
## 59) color=D,E,F,G 1450 7863362000 13574.850 *
## 15) carat>=1.915 2184 16771810000 14834.690 *
rpart.plot(m1, type=3, digits=3, fallen.leaves = TRUE)
#
p1 <- predict(m1, df)
Mean_Absolute_Error <- function(actual, predicted) {mean(abs(actual - predicted))}
Mean_Absolute_Error
## function(actual, predicted) {mean(abs(actual - predicted))}
Mean_Absolute_Error(df$price, p1)
## [1] 833.0197
R - Regression Trees - CART video of Jalayer Academy https://www.youtube.com/watch?v=MoBw5PiW56k&t=321s
For Data Visualization https://www.kaggle.com/devisangeetha/data-visualization