hw2_uluturktekteny
Source:
https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/LaborSupply.csv
https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/Ecdat/LaborSupply.html
# First, check working directory
getwd()
## [1] "C:/Users/Yağmur Ulutürk/Desktop/r_base/HW2"
# Import data set.(Wages and Hours Worked)
# 5320 obs. of 8 variables
laborSupply=read.csv('LaborSupply.csv')
Explanations of variables: lnhr=log of annual hours worked lnwg=log of hourly wage kids=number of children disab=bad health
# Structure of the data
str(laborSupply)
## 'data.frame': 5320 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ lnhr : num 7.58 7.75 7.65 7.47 7.5 7.5 7.56 7.76 7.86 7.82 ...
## $ lnwg : num 1.91 1.89 1.91 1.89 1.94 1.93 2.12 1.94 1.99 1.98 ...
## $ kids : int 2 2 2 2 2 2 2 2 2 2 ...
## $ age : int 27 28 29 30 31 32 33 34 35 36 ...
## $ disab: int 0 0 0 0 0 0 0 0 0 0 ...
## $ id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
# Summary of data set.
summary(laborSupply)
## X lnhr lnwg kids
## Min. : 1 Min. :2.770 Min. :-0.260 Min. :0.000
## 1st Qu.:1331 1st Qu.:7.580 1st Qu.: 2.370 1st Qu.:1.000
## Median :2660 Median :7.650 Median : 2.640 Median :2.000
## Mean :2660 Mean :7.657 Mean : 2.609 Mean :1.556
## 3rd Qu.:3990 3rd Qu.:7.780 3rd Qu.: 2.860 3rd Qu.:2.000
## Max. :5320 Max. :8.560 Max. : 4.690 Max. :6.000
## age disab id year
## Min. :22.00 Min. :0.0000 Min. : 1.0 Min. :1979
## 1st Qu.:32.00 1st Qu.:0.0000 1st Qu.:133.8 1st Qu.:1981
## Median :38.00 Median :0.0000 Median :266.5 Median :1984
## Mean :38.92 Mean :0.0609 Mean :266.5 Mean :1984
## 3rd Qu.:45.00 3rd Qu.:0.0000 3rd Qu.:399.2 3rd Qu.:1986
## Max. :60.00 Max. :1.0000 Max. :532.0 Max. :1988
#Plot all the variable in tha data
plot(laborSupply)
# Count rows.
nrow(laborSupply)
## [1] 5320
# Count columns.
ncol(laborSupply)
## [1] 8
# In order to make visualization, first call ggplot2
library(ggplot2)
# See histogram of ages.
ggplot(data=laborSupply, aes(x=laborSupply$age))+
geom_histogram(binwidth = 2)+
scale_x_continuous(limits = c(20, 60), breaks = seq(20, 60, 5))+
labs(title="Histogram of age data",x="Age",y="Count")
As you can see age data seems normally distributed.
# See histogram of hourly wage(log).
ggplot(data=laborSupply, aes(x=laborSupply$lnwg))+
geom_histogram(binwidth = 0.10)+
scale_x_continuous(limits = c(1, 5), breaks = seq(1, 5, 0.5))+
labs(title="Histogram of log of hourly wage data",x="wage",y="Count")
## Warning: Removed 6 rows containing non-finite values (stat_bin).
As you can see wage data also seems normally distributed.
# See scatterplot for annual hours worked and hourly wage.
ggplot(data=laborSupply, aes(x=laborSupply$lnhr, y=laborSupply$lnwg))+
geom_point(size=0.5,col='purple')+
scale_x_continuous(limits = c(5, 9), breaks = seq(5, 9, 0.5))+
scale_y_continuous(limits = c(1, 4.5), breaks = seq(1, 4.5, 0.5))+
labs(title="Comparing log of hourly wage and log of annual hours worked",x="hours worked",y="hourly wage")+
geom_smooth(method="gam", size=0.8)
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
As seen in the scatterplot, hourly wage is increasing as annual hours worked.