hw2_uluturktekteny

Source:

https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/LaborSupply.csv

https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/Ecdat/LaborSupply.html

# First, check working directory
getwd()
## [1] "C:/Users/Yağmur Ulutürk/Desktop/r_base/HW2"
# Import data set.(Wages and Hours Worked)
# 5320 obs. of 8 variables
laborSupply=read.csv('LaborSupply.csv')

Explanations of variables: lnhr=log of annual hours worked lnwg=log of hourly wage kids=number of children disab=bad health

# Structure of the data
str(laborSupply)
## 'data.frame':    5320 obs. of  8 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ lnhr : num  7.58 7.75 7.65 7.47 7.5 7.5 7.56 7.76 7.86 7.82 ...
##  $ lnwg : num  1.91 1.89 1.91 1.89 1.94 1.93 2.12 1.94 1.99 1.98 ...
##  $ kids : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ age  : int  27 28 29 30 31 32 33 34 35 36 ...
##  $ disab: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ id   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ year : int  1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
# Summary of data set.
summary(laborSupply)
##        X             lnhr            lnwg             kids      
##  Min.   :   1   Min.   :2.770   Min.   :-0.260   Min.   :0.000  
##  1st Qu.:1331   1st Qu.:7.580   1st Qu.: 2.370   1st Qu.:1.000  
##  Median :2660   Median :7.650   Median : 2.640   Median :2.000  
##  Mean   :2660   Mean   :7.657   Mean   : 2.609   Mean   :1.556  
##  3rd Qu.:3990   3rd Qu.:7.780   3rd Qu.: 2.860   3rd Qu.:2.000  
##  Max.   :5320   Max.   :8.560   Max.   : 4.690   Max.   :6.000  
##       age            disab              id             year     
##  Min.   :22.00   Min.   :0.0000   Min.   :  1.0   Min.   :1979  
##  1st Qu.:32.00   1st Qu.:0.0000   1st Qu.:133.8   1st Qu.:1981  
##  Median :38.00   Median :0.0000   Median :266.5   Median :1984  
##  Mean   :38.92   Mean   :0.0609   Mean   :266.5   Mean   :1984  
##  3rd Qu.:45.00   3rd Qu.:0.0000   3rd Qu.:399.2   3rd Qu.:1986  
##  Max.   :60.00   Max.   :1.0000   Max.   :532.0   Max.   :1988
#Plot all the variable in tha data
plot(laborSupply)

# Count rows.
nrow(laborSupply)
## [1] 5320
# Count columns.
ncol(laborSupply)
## [1] 8
# In order to make visualization, first call ggplot2
library(ggplot2)
# See histogram of ages.
ggplot(data=laborSupply, aes(x=laborSupply$age))+
  geom_histogram(binwidth = 2)+
  scale_x_continuous(limits = c(20, 60), breaks = seq(20, 60, 5))+
  labs(title="Histogram of age data",x="Age",y="Count")

As you can see age data seems normally distributed.

# See histogram of hourly wage(log).
ggplot(data=laborSupply, aes(x=laborSupply$lnwg))+
  geom_histogram(binwidth = 0.10)+
  scale_x_continuous(limits = c(1, 5), breaks = seq(1, 5, 0.5))+
  labs(title="Histogram of log of hourly wage data",x="wage",y="Count")
## Warning: Removed 6 rows containing non-finite values (stat_bin).

As you can see wage data also seems normally distributed.

# See scatterplot for annual hours worked and hourly wage.
ggplot(data=laborSupply, aes(x=laborSupply$lnhr, y=laborSupply$lnwg))+
  geom_point(size=0.5,col='purple')+
  scale_x_continuous(limits = c(5, 9), breaks = seq(5, 9, 0.5))+
  scale_y_continuous(limits = c(1, 4.5), breaks = seq(1, 4.5, 0.5))+
  labs(title="Comparing log of hourly wage and log of annual hours worked",x="hours worked",y="hourly wage")+
  geom_smooth(method="gam", size=0.8)
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).

As seen in the scatterplot, hourly wage is increasing as annual hours worked.