hw2_tektens



This is HW2. I’m going to use “Wife Working Hours” dataset from: https://vincentarelbundock.github.io/Rdatasets/datasets.html https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/Ecdat/Workinghours.html

This is HW2 explanation text from Blackboard:
+ Find an interesting data set and do some light exploratory analysis on it.
+ Nothing too long.
+ Introduce the data set’s properties (number of rows and columns, nature of columns and rows, some interesting visualization and what can be done with it).
+ Use RMarkdown and html output just like the Hw before.
+ Explicitly state all your code and refer to the place of the data set.


# Check working directory
getwd()
## [1] "/home/semihtekten/Documents/r_base/bda503_hw2"


Working directory is correct. Let’s import data from our csv file.


# Import csv file without header line.
dat = read.csv("workinghours.csv", header = TRUE)


Data is imported. Let’s find out some details.


# Count rows
nrow(dat)
## [1] 3382


# Count columns
ncol(dat)
## [1] 13


# Structure of data
str(dat)
## 'data.frame':    3382 obs. of  13 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hours     : int  2000 390 1900 0 3177 0 0 1040 2040 0 ...
##  $ income    : int  350 241 160 80 456 390 181 726 -5 78 ...
##  $ age       : int  26 29 33 20 33 22 41 31 33 30 ...
##  $ education : int  12 8 10 9 12 12 9 16 12 11 ...
##  $ child5    : int  0 0 0 2 0 2 0 2 0 1 ...
##  $ child13   : int  1 1 2 0 2 0 0 1 3 1 ...
##  $ child17   : int  0 1 0 0 0 0 1 0 0 0 ...
##  $ nonwhite  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ owned     : int  1 1 1 1 1 1 1 1 0 0 ...
##  $ mortgage  : int  1 1 0 1 1 1 0 1 0 0 ...
##  $ occupation: Factor w/ 4 levels "fr","mp","other",..: 4 3 4 3 4 3 4 2 1 3 ...
##  $ unemp     : int  7 4 7 7 7 7 7 3 4 5 ...


# Summary of data
summary(dat)
##        X              hours          income            age       
##  Min.   :   1.0   Min.   :   0   Min.   :-139.0   Min.   :18.00  
##  1st Qu.: 846.2   1st Qu.:   0   1st Qu.: 146.0   1st Qu.:28.00  
##  Median :1691.5   Median :1304   Median : 247.0   Median :34.00  
##  Mean   :1691.5   Mean   :1135   Mean   : 296.9   Mean   :36.81  
##  3rd Qu.:2536.8   3rd Qu.:1944   3rd Qu.: 368.8   3rd Qu.:44.00  
##  Max.   :3382.0   Max.   :5840   Max.   :7220.0   Max.   :64.00  
##    education         child5          child13          child17     
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:12.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :12.00   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :12.55   Mean   :0.5074   Mean   :0.5618   Mean   :0.215  
##  3rd Qu.:14.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :17.00   Max.   :4.0000   Max.   :5.0000   Max.   :6.000  
##     nonwhite          owned          mortgage      occupation  
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   fr   :  85  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   mp   : 962  
##  Median :0.0000   Median :1.000   Median :1.0000   other:1314  
##  Mean   :0.2957   Mean   :0.681   Mean   :0.5278   swcc :1021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000               
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000               
##      unemp       
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 5.000  
##  Mean   : 5.641  
##  3rd Qu.: 7.000  
##  Max.   :30.000


Let’s do some visualization using ggplot2.


# Call the library ggplot2
library(ggplot2)


# Wife working hours per year, excluding outliers
ggplot(data=dat,aes(x=dat$hours)) +
  geom_histogram(breaks=seq(1, 4001, by=100),col="black",fill="azure4") +
  labs(title="Histogram of wife working hours per year",x="Working Hours",y="Count")

Most of the working wives are working 2,000 hours in a year, which is approximately 5.5 hours a day.

# Household income, excluding outliers and zero points
ggplot(data=dat,aes(x=dat$income)) +
  geom_histogram(breaks=seq(0, 1000, by=25),col="black",fill="azure4") +
  labs(title="Histogram of household income in hundreds of dollars",x="Annual Household Income",y="Count")

As we can easily observe, annual income data is right skewed.

ggplot(data=dat,aes(x=dat$income,y=dat$hours)) +
  geom_point(colour = "azure4", size = 0.5) +
  xlim(0, 2000) +
  ylim(0, 3000) +
  geom_smooth(method="gam", size=0.8) +
  labs(title="Working Hours & Income",x="Income",y="Hours")
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 35 rows containing missing values (geom_point).

We can observe that working hours are decreasing as income increases.




Documentation:

R: Wife Working Hours
Description
a cross-section from 1987
number of observations : 3382
observation : individuals
country : United States

Usage

data(Workinghours)

Format
A dataframe containing:

hours
wife working hours per year

income
the other household income in hundreds of dollars

age
age of the wife

education
education years of the wife

child5
number of children for ages 0 to 5

child13
number of children for ages 6 to 13

child17
number of children for ages 14 to 17

nonwhite
non-white?

owned
is the home owned by the household ?

mortgage
is the home on mortgage ?

occupation
occupation of the husband, one of mp (manager or

unemp
local unemployment rate in %

Source
Lee, Myoung–Jae (1995) “Semi–parametric estimation of simultaneous equations with limited dependent variables : a case study of female labour supply”, Journal of Applied Econometrics, 10(2), april–june, 187–200.

References
Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/