This is HW2. I’m going to use “Wife Working Hours” dataset from: https://vincentarelbundock.github.io/Rdatasets/datasets.html https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/Ecdat/Workinghours.html
This is HW2 explanation text from Blackboard:
+ Find an interesting data set and do some light exploratory analysis on it.
+ Nothing too long.
+ Introduce the data set’s properties (number of rows and columns, nature of columns and rows, some interesting visualization and what can be done with it).
+ Use RMarkdown and html output just like the Hw before.
+ Explicitly state all your code and refer to the place of the data set.
# Check working directory
getwd()
## [1] "/home/semihtekten/Documents/r_base/bda503_hw2"
Working directory is correct. Let’s import data from our csv file.
# Import csv file without header line.
dat = read.csv("workinghours.csv", header = TRUE)
Data is imported. Let’s find out some details.
# Count rows
nrow(dat)
## [1] 3382
# Count columns
ncol(dat)
## [1] 13
# Structure of data
str(dat)
## 'data.frame': 3382 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ hours : int 2000 390 1900 0 3177 0 0 1040 2040 0 ...
## $ income : int 350 241 160 80 456 390 181 726 -5 78 ...
## $ age : int 26 29 33 20 33 22 41 31 33 30 ...
## $ education : int 12 8 10 9 12 12 9 16 12 11 ...
## $ child5 : int 0 0 0 2 0 2 0 2 0 1 ...
## $ child13 : int 1 1 2 0 2 0 0 1 3 1 ...
## $ child17 : int 0 1 0 0 0 0 1 0 0 0 ...
## $ nonwhite : int 0 0 0 0 0 0 0 0 0 0 ...
## $ owned : int 1 1 1 1 1 1 1 1 0 0 ...
## $ mortgage : int 1 1 0 1 1 1 0 1 0 0 ...
## $ occupation: Factor w/ 4 levels "fr","mp","other",..: 4 3 4 3 4 3 4 2 1 3 ...
## $ unemp : int 7 4 7 7 7 7 7 3 4 5 ...
# Summary of data
summary(dat)
## X hours income age
## Min. : 1.0 Min. : 0 Min. :-139.0 Min. :18.00
## 1st Qu.: 846.2 1st Qu.: 0 1st Qu.: 146.0 1st Qu.:28.00
## Median :1691.5 Median :1304 Median : 247.0 Median :34.00
## Mean :1691.5 Mean :1135 Mean : 296.9 Mean :36.81
## 3rd Qu.:2536.8 3rd Qu.:1944 3rd Qu.: 368.8 3rd Qu.:44.00
## Max. :3382.0 Max. :5840 Max. :7220.0 Max. :64.00
## education child5 child13 child17
## Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :12.00 Median :0.0000 Median :0.0000 Median :0.000
## Mean :12.55 Mean :0.5074 Mean :0.5618 Mean :0.215
## 3rd Qu.:14.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :17.00 Max. :4.0000 Max. :5.0000 Max. :6.000
## nonwhite owned mortgage occupation
## Min. :0.0000 Min. :0.000 Min. :0.0000 fr : 85
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 mp : 962
## Median :0.0000 Median :1.000 Median :1.0000 other:1314
## Mean :0.2957 Mean :0.681 Mean :0.5278 swcc :1021
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
## unemp
## Min. : 1.000
## 1st Qu.: 4.000
## Median : 5.000
## Mean : 5.641
## 3rd Qu.: 7.000
## Max. :30.000
Let’s do some visualization using ggplot2.
# Call the library ggplot2
library(ggplot2)
# Wife working hours per year, excluding outliers
ggplot(data=dat,aes(x=dat$hours)) +
geom_histogram(breaks=seq(1, 4001, by=100),col="black",fill="azure4") +
labs(title="Histogram of wife working hours per year",x="Working Hours",y="Count")
Most of the working wives are working 2,000 hours in a year, which is approximately 5.5 hours a day.
# Household income, excluding outliers and zero points
ggplot(data=dat,aes(x=dat$income)) +
geom_histogram(breaks=seq(0, 1000, by=25),col="black",fill="azure4") +
labs(title="Histogram of household income in hundreds of dollars",x="Annual Household Income",y="Count")
As we can easily observe, annual income data is right skewed.
ggplot(data=dat,aes(x=dat$income,y=dat$hours)) +
geom_point(colour = "azure4", size = 0.5) +
xlim(0, 2000) +
ylim(0, 3000) +
geom_smooth(method="gam", size=0.8) +
labs(title="Working Hours & Income",x="Income",y="Hours")
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 35 rows containing missing values (geom_point).
We can observe that working hours are decreasing as income increases.
Documentation:
R: Wife Working Hours
Description
a cross-section from 1987
number of observations : 3382
observation : individuals
country : United States
Usage
data(Workinghours)
Format
A dataframe containing:
hours
wife working hours per year
income
the other household income in hundreds of dollars
age
age of the wife
education
education years of the wife
child5
number of children for ages 0 to 5
child13
number of children for ages 6 to 13
child17
number of children for ages 14 to 17
nonwhite
non-white?
owned
is the home owned by the household ?
mortgage
is the home on mortgage ?
occupation
occupation of the husband, one of mp (manager or
unemp
local unemployment rate in %
Source
Lee, Myoung–Jae (1995) “Semi–parametric estimation of simultaneous equations with limited dependent variables : a case study of female labour supply”, Journal of Applied Econometrics, 10(2), april–june, 187–200.
References
Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/