Vocabulary and Education Dataset

Vocabulary and Education dataset supplied by U.S. General Social Surveys, 1972-2004. It contains vocabulary test score of respondents according to some parameters in U.S. between 1974-2004.

Content

The Vocab dataset has 21,638 rows and 5 columns.


Initial Exploratory Analysis

This dataset includes 21,638 observations and 5 variables as listed below. I try to find correlation between Vocabulary test grade and Education. Additionally, I examine the difference between two genders according the years.

#Quick review of dataset
glimpse(dsin)
## Observations: 21,638
## Variables: 5
## $ X          <int> 20040001, 20040002, 20040003, 20040005, 20040008, 2...
## $ year       <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 200...
## $ sex        <fctr> Female, Female, Male, Female, Male, Male, Female, ...
## $ education  <int> 9, 14, 14, 17, 14, 14, 12, 10, 11, 9, 16, 11, 14, 1...
## $ vocabulary <int> 3, 6, 9, 8, 1, 7, 6, 6, 5, 1, 4, 6, 9, 0, 6, 9, 10,...
#Which years are included on research and average education in year and vocabulary test score according to years.

g3<-dsin %>% group_by(year) %>% summarise(Avg_Educ_In_Year = round(sum(education)/n(),2) ,Avg_Vocab_Test=round(sum(vocabulary)/n(),2))
## Warning: package 'bindrcpp' was built under R version 3.4.2
g3
## # A tibble: 16 x 3
##     year Avg_Educ_In_Year Avg_Vocab_Test
##    <int>            <dbl>          <dbl>
##  1  1974            11.87           6.02
##  2  1976            11.83           6.04
##  3  1978            12.04           5.96
##  4  1982            12.21           5.74
##  5  1984            12.48           5.99
##  6  1987            12.57           5.69
##  7  1988            12.75           5.77
##  8  1989            12.84           5.94
##  9  1990            13.12           6.14
## 10  1991            12.90           6.09
## 11  1993            13.10           6.03
## 12  1994            13.31           6.17
## 13  1996            13.44           6.04
## 14  1998            13.39           6.13
## 15  2000            13.31           6.01
## 16  2004            13.75           6.21
ggplot(g3, aes(year)) + 
  geom_point(aes(y = Avg_Vocab_Test, colour = "Average Vocabulary Test Score")) +geom_point(aes(y = Avg_Educ_In_Year, colour = "Average Education in Year"))+scale_colour_manual(values=c("red", "blue")) +theme_minimal() + xlab("Year") + ylab("Average") +ggtitle("Is there any correlation?")

##Expand the output with genders.
a<-dsin %>% group_by(year,sex)%>%summarise(cnt=n(),edu_sum=sum(education),voc_sum=sum(vocabulary),avg_edu=round(edu_sum/cnt,2), avg_vocab=round(voc_sum/cnt,2))
##Female
a%>%filter(sex=="Female")
## # A tibble: 16 x 7
## # Groups:   year [16]
##     year    sex   cnt edu_sum voc_sum avg_edu avg_vocab
##    <int> <fctr> <int>   <int>   <int>   <dbl>     <dbl>
##  1  1974 Female   774    9152    4704   11.82      6.08
##  2  1976 Female   791    9174    4845   11.60      6.13
##  3  1978 Female   861   10151    5183   11.79      6.02
##  4  1982 Female   999   12097    5760   12.11      5.77
##  5  1984 Female   828   10213    4990   12.33      6.03
##  6  1987 Female   955   11852    5459   12.41      5.72
##  7  1988 Female   513    6441    2986   12.56      5.82
##  8  1989 Female   559    7006    3304   12.53      5.91
##  9  1990 Female   481    6232    2930   12.96      6.09
## 10  1991 Female   565    7185    3437   12.72      6.08
## 11  1993 Female   560    7252    3385   12.95      6.04
## 12  1994 Female  1085   14410    6743   13.28      6.21
## 13  1996 Female  1050   13888    6382   13.23      6.08
## 14  1998 Female   760   10073    4719   13.25      6.21
## 15  2000 Female   730    9551    4421   13.08      6.06
## 16  2004 Female   801   10990    5027   13.72      6.28
##Male
a%>%filter(sex=="Male")
## # A tibble: 16 x 7
## # Groups:   year [16]
##     year    sex   cnt edu_sum voc_sum avg_edu avg_vocab
##    <int> <fctr> <int>   <int>   <int>   <dbl>     <dbl>
##  1  1974   Male   672    8019    4007   11.93      5.96
##  2  1976   Male   643    7797    3823   12.13      5.95
##  3  1978   Male   623    7722    3669   12.39      5.89
##  4  1982   Male   724    8943    4132   12.35      5.71
##  5  1984   Male   574    7278    3414   12.68      5.95
##  6  1987   Male   724    9256    4102   12.78      5.67
##  7  1988   Male   407    5292    2319   13.00      5.70
##  8  1989   Male   409    5427    2446   13.27      5.98
##  9  1990   Male   371    4943    2300   13.32      6.20
## 10  1991   Male   396    5214    2416   13.17      6.10
## 11  1993   Male   453    6022    2727   13.29      6.02
## 12  1994   Male   755   10072    4605   13.34      6.10
## 13  1996   Male   816   11185    4888   13.71      5.99
## 14  1998   Male   541    7348    3258   13.58      6.02
## 15  2000   Male   581    7893    3460   13.59      5.96
## 16  2004   Male   637    8782    3904   13.79      6.13
##Draw a graph which explains the difference between genders according to year.
ggplot(data=a, aes(x=sex,y=year,color=avg_vocab,size=5))+ coord_flip() + geom_point() + scale_color_continuous(low="red",high="black",guide="colourbar") +ggtitle("Who failed?") +
xlab("Gender") + ylab("Years")


References

Dataset

https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/car/Vocab.csv

Analysis

https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

https://www.rstudio.com/resources/cheatsheets/

https://mef-bda503.github.io/files/02_Tidyverse.html#1

Graphs

https://plot.ly/ggplot2/

http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

http://ggplot2.tidyverse.org/

https://stackoverflow.com/questions/3777174/plotting-two-variables-as-lines-using-ggplot2-on-the-same-graph