DIAMONDS

First we load necessary libraries,

set.seed(503)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(psych)
getwd()
## [1] "C:/Users/merye/Documents/GitHub/pj-MeryemKemerci"
data<-read.csv("C:\\Users\\merye\\Desktop\\diamonds.csv")
data %>%
  tbl_df() 
## # A tibble: 53,940 x 11
##        X carat       cut  color clarity depth table price     x     y
##    <int> <dbl>    <fctr> <fctr>  <fctr> <dbl> <dbl> <int> <dbl> <dbl>
##  1     1  0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98
##  2     2  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84
##  3     3  0.23      Good      E     VS1  56.9    65   327  4.05  4.07
##  4     4  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23
##  5     5  0.31      Good      J     SI2  63.3    58   335  4.34  4.35
##  6     6  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96
##  7     7  0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98
##  8     8  0.26 Very Good      H     SI1  61.9    55   337  4.07  4.11
##  9     9  0.22      Fair      E     VS2  65.1    61   337  3.87  3.78
## 10    10  0.23 Very Good      H     VS1  59.4    61   338  4.00  4.05
## # ... with 53,930 more rows, and 1 more variables: z <dbl>

Content

1.Price in US dollars ($326–$18,823)

2.Carat weight of the diamond (0.2–5.01)

3.Cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

4.Color diamond from J (worst) to D (best)

5.Clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

6.X length in mm (0–10.74)

7.Y width in mm (0–58.9)

8.Z depth in mm (0–31.8)

9.Depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

10.Table width of top of diamond relative to widest point (43–95)

1.We check data :

glimpse(data)
## Observations: 53,940
## Variables: 11
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <fctr> Ideal, Premium, Good, Premium, Good, Very Good, Very ...
## $ color   <fctr> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J,...
## $ clarity <fctr> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, S...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

2.Checking Statistical values for each column :

summary(data)
##        X             carat               cut        color    
##  Min.   :    1   Min.   :0.2000   Fair     : 1610   D: 6775  
##  1st Qu.:13486   1st Qu.:0.4000   Good     : 4906   E: 9797  
##  Median :26971   Median :0.7000   Ideal    :21551   F: 9542  
##  Mean   :26971   Mean   :0.7979   Premium  :13791   G:11292  
##  3rd Qu.:40455   3rd Qu.:1.0400   Very Good:12082   H: 8304  
##  Max.   :53940   Max.   :5.0100                     I: 5422  
##                                                     J: 2808  
##     clarity          depth           table           price      
##  SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
##  VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
##  (Other): 2531                                                  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 
qplot(carat, price, data=data, color=color, shape=cut)

qplot(log(carat), log(price),
data=data, color=clarity)

When we take log values of carat and price, we found “linear” plot. Price and carat variables are positively associated.

Facet function helps us to see numeric data together,

qplot(price, carat, data=data,
facets = . ~ color)

We will see color and clarity

qplot(price, carat, data=data,
facets = color ~ clarity)

qplot(cut, data=data, geom="bar")

# p<-ggplot(data=data, aes(x=cut, y=price)) +
#   geom_bar(stat="identity", fill="steelblue")
# 
# p
qplot(price, data=data, binwidth = 1000,
geom="histogram")

b1 <- ggplot(data, aes(x=color,fill=as.character(data$cut) ))+
  theme(axis.text.x = element_text(angle = 60, hjust=1))+
  geom_bar() +
  labs(x="Diamonds Colors", y="Diamonds Price", fill="Cut")
b1

Principal Component Analysis

References : image ref : https://www.google.com.tr/search?q=diamond&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj98rTc9PDXAhWIJVAKHezbBXQQ_AUICigB&biw=1280&bih=590#imgrc=9a9zoVEXY8u10M: http://had.co.nz/stat480/lectures/07-r-intro.pdf