Content
The Vocab dataset has 21,638 rows and 5 columns.
- year: Year of the survey.
- sex: Sex of the respondent, Female or Male.
- education: Education in years.
- vocabulary: Vocabulary test score: number correct on a 10-word test.
Vocabulary and Education dataset supplied by U.S. General Social Surveys, 1972-2004. It contains vocabulary test score of respondents according to some parameters in U.S. between 1974-2004.
The Vocab dataset has 21,638 rows and 5 columns.
This dataset includes 21,638 observations and 5 variables as listed below. I try to find correlation between Vocabulary test grade and Education. Additionally, I examine the difference between two genders according the years.
#Quick review of dataset
glimpse(dsin)
## Observations: 21,638
## Variables: 5
## $ X <int> 20040001, 20040002, 20040003, 20040005, 20040008, 2...
## $ year <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 200...
## $ sex <fctr> Female, Female, Male, Female, Male, Male, Female, ...
## $ education <int> 9, 14, 14, 17, 14, 14, 12, 10, 11, 9, 16, 11, 14, 1...
## $ vocabulary <int> 3, 6, 9, 8, 1, 7, 6, 6, 5, 1, 4, 6, 9, 0, 6, 9, 10,...
#Which years are included on research and average education in year and vocabulary test score according to years.
g3<-dsin %>% group_by(year) %>% summarise(Avg_Educ_In_Year = round(sum(education)/n(),2) ,Avg_Vocab_Test=round(sum(vocabulary)/n(),2))
## Warning: package 'bindrcpp' was built under R version 3.4.2
g3
## # A tibble: 16 x 3
## year Avg_Educ_In_Year Avg_Vocab_Test
## <int> <dbl> <dbl>
## 1 1974 11.87 6.02
## 2 1976 11.83 6.04
## 3 1978 12.04 5.96
## 4 1982 12.21 5.74
## 5 1984 12.48 5.99
## 6 1987 12.57 5.69
## 7 1988 12.75 5.77
## 8 1989 12.84 5.94
## 9 1990 13.12 6.14
## 10 1991 12.90 6.09
## 11 1993 13.10 6.03
## 12 1994 13.31 6.17
## 13 1996 13.44 6.04
## 14 1998 13.39 6.13
## 15 2000 13.31 6.01
## 16 2004 13.75 6.21
ggplot(g3, aes(year)) +
geom_point(aes(y = Avg_Vocab_Test, colour = "Average Vocabulary Test Score")) +geom_point(aes(y = Avg_Educ_In_Year, colour = "Average Education in Year"))+scale_colour_manual(values=c("red", "blue")) +theme_minimal() + xlab("Year") + ylab("Average") +ggtitle("Is there any correlation?")
##Expand the output with genders.
a<-dsin %>% group_by(year,sex)%>%summarise(cnt=n(),edu_sum=sum(education),voc_sum=sum(vocabulary),avg_edu=round(edu_sum/cnt,2), avg_vocab=round(voc_sum/cnt,2))
##Female
a%>%filter(sex=="Female")
## # A tibble: 16 x 7
## # Groups: year [16]
## year sex cnt edu_sum voc_sum avg_edu avg_vocab
## <int> <fctr> <int> <int> <int> <dbl> <dbl>
## 1 1974 Female 774 9152 4704 11.82 6.08
## 2 1976 Female 791 9174 4845 11.60 6.13
## 3 1978 Female 861 10151 5183 11.79 6.02
## 4 1982 Female 999 12097 5760 12.11 5.77
## 5 1984 Female 828 10213 4990 12.33 6.03
## 6 1987 Female 955 11852 5459 12.41 5.72
## 7 1988 Female 513 6441 2986 12.56 5.82
## 8 1989 Female 559 7006 3304 12.53 5.91
## 9 1990 Female 481 6232 2930 12.96 6.09
## 10 1991 Female 565 7185 3437 12.72 6.08
## 11 1993 Female 560 7252 3385 12.95 6.04
## 12 1994 Female 1085 14410 6743 13.28 6.21
## 13 1996 Female 1050 13888 6382 13.23 6.08
## 14 1998 Female 760 10073 4719 13.25 6.21
## 15 2000 Female 730 9551 4421 13.08 6.06
## 16 2004 Female 801 10990 5027 13.72 6.28
##Male
a%>%filter(sex=="Male")
## # A tibble: 16 x 7
## # Groups: year [16]
## year sex cnt edu_sum voc_sum avg_edu avg_vocab
## <int> <fctr> <int> <int> <int> <dbl> <dbl>
## 1 1974 Male 672 8019 4007 11.93 5.96
## 2 1976 Male 643 7797 3823 12.13 5.95
## 3 1978 Male 623 7722 3669 12.39 5.89
## 4 1982 Male 724 8943 4132 12.35 5.71
## 5 1984 Male 574 7278 3414 12.68 5.95
## 6 1987 Male 724 9256 4102 12.78 5.67
## 7 1988 Male 407 5292 2319 13.00 5.70
## 8 1989 Male 409 5427 2446 13.27 5.98
## 9 1990 Male 371 4943 2300 13.32 6.20
## 10 1991 Male 396 5214 2416 13.17 6.10
## 11 1993 Male 453 6022 2727 13.29 6.02
## 12 1994 Male 755 10072 4605 13.34 6.10
## 13 1996 Male 816 11185 4888 13.71 5.99
## 14 1998 Male 541 7348 3258 13.58 6.02
## 15 2000 Male 581 7893 3460 13.59 5.96
## 16 2004 Male 637 8782 3904 13.79 6.13
##Draw a graph which explains the difference between genders according to year.
ggplot(data=a, aes(x=sex,y=year,color=avg_vocab,size=5))+ coord_flip() + geom_point() + scale_color_continuous(low="red",high="black",guide="colourbar") +ggtitle("Who failed?") +
xlab("Gender") + ylab("Years")