General Instructions

Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it.

Here are significant points that you should read carefully.

Part I: Short and Simple (20 pts)

The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).

  1. What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible? See Hadley Wickham’s point (and other discussion in the topic) before making your argument (https://stackoverflow.com/a/3101876/3608936). See an example of two y-axis graph on https://mef-bda503.github.io/gpj-rjunkies/files/project/index.html#comparing__of_accidents___of_departures

Before going into details, I would like to point out that graphs with dual scaled axes might summary lots of information in one graph but it might be confusing for audience. I haven’t used it before at work, I would rather using two seperate graphs.

According to discussions in stackoverflow, there are some ground problems about it. First one is about using same units of measure in two y-axis graphs. This leads to confusion between two categories. Second problem is about bar graphs. Bar graphs are good for comparisons, but two bars with different units of measure, makes bar graph meaningless. Final problem is about different scales (bin size). This might also mislead the auidience.

  1. What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

Would you present an argument for a policy that you are more inclined to (e.g. suppose you are more inclined to allocate budget to fix gender inequality than affordable healthcare) or would you just present what data says? In other words, would the (honest) title of your presentation be “Gender Inequality - The Most Important Social Problem Backed by Data” or “Pain Points in Our Society and Optimal Budget Allocation”?

In my opinion, exploratory data analysis workflow should be as below;

If I have all the necesary data for welfare projects, I would start with analyzing the data and have an insight about subjects. I would try to find each subjects weighted average of importance in society, their needs, their current&previous budgets and finally I would try to predict answer of this question: if a particular subject is getting more funds, how would effect this to society? After the analysis, I would try to come up with best budget allocation, where the subjects are benefited according to their needs and importance in society.

  1. What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?

Times series allows to see trends over the time, data is in a series of particular time periods or intervals. Despite of the nature of non time series data, next value in a time series is effected by earlier values in dataset.

In time series analysis, main goal is to predict the future by previous data. However, in non-time series data entire dataset is available, there isn’t a notion of past&future, concept of time is irrelavant.

  1. If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies, after you load ggplot2movies package.)

After checking the summary of the movies dataset, first question in my mind was number of movies between years 1893 and 2005. In order to visualize this, I used ggplot2 library and wrote the simple code below for the histogram;

library(ggplot2movies)
library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.3.4     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.7.2     <U+221A> stringr 1.2.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
?movies
## starting httpd help server ...
##  done
ggplot(movies, aes(year)) +geom_bar() +ylab("")

Part II: Extending Your Group Project (30 pts)

In this part you are going to extend your group project with an additional analysis supported by some visualizations. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.

In our project we used 4 tables for our data analysis. One of them is called “The International Dataset”, which shows infant mortalities and life expectancies. There are additional information about mortality rates under 5 years old children. I will make analysis for it and compare them with our previous findings.

The graph below shows that mortality rates are higher in here, since this graph includes both infants and children, whose ages are up to 5.

Mortality rates up to 5 years old are greater in boys, compared to girls.

mortality<-read.csv(file="mortality_life_expectancy.csv",header=TRUE)
qplot(x = mortality_rate_under5_male, y = mortality_rate_under5_female,
      data = mortality)

The graph below shows mortality rates over the years.It has a similar trend with infant mortality graph over the years. The notable difference is, in late 50’s mortality rates up to 5 years old is nearly two times of infant mortality rates. However, predictions show that after 2020, mortality rates up to 5 become closer to infant mortality rates. Both gender rates decrease sharply after 1980.

library(dplyr)
library(dygraphs)
meanle<-mortality%>%
  group_by(year)%>%
  summarise(le=sum(mortality_rate_under5),lem=sum(mortality_rate_under5_male),
          lef=sum(mortality_rate_under5_female),n=n())%>%
  mutate(mean_mortality_rate_under5=le/n,mean_mortality_rate_under5_male=lem/n,
         mean_mortality_rate_under5_female=lef/n)%>%
  arrange(year)%>% select(year,mean_mortality_rate_under5,mean_mortality_rate_under5_male,
         mean_mortality_rate_under5_female)
meanle$year<-as.numeric(meanle$year)

lifem <- dygraph(meanle, main = "Years", ylab = "mortality rate under5
(over 1000 infant)") %>%
  dyRangeSelector()
lifem

In the graph below, we can see Turkey related data, starting from 1980. Average mortality rates up to 5 years old in Turkey are smaller than world average. Predictions show that average mortality rate will be 15.47 (over 1000) by 2040 in world, but it is 9.54 for Turkey.

mortality_turkey <- filter(mortality, country_name == "Turkey")
meanle<-mortality_turkey%>%
  group_by(year)%>%
  summarise(le=sum(mortality_rate_under5),lem=sum(mortality_rate_under5_male),
        lef=sum(mortality_rate_under5_female),n=n())%>%
  mutate(mean_mortality_rate_under5=le/n,mean_mortality_rate_under5_male=lem/n,
        mean_mortality_rate_under5_female=lef/n)%>%
  arrange(year)%>%select(year,mean_mortality_rate_under5,
        mean_mortality_rate_under5_male,mean_mortality_rate_under5_female)

meanle$year<-as.numeric(meanle$year)

lifem <- dygraph(meanle, main = "Years", ylab = "mortality rate under5 in Turkey
(over 1000 infant)") %>%
  dyRangeSelector()
lifem

Part III: Welcome to Real Life (50 pts)

As all of you know well enough; real life data is not readly available and it is messy. In this part, you are going to gather data from Higher Education Council’s (YÖK) data service. You can use all the data provided on https://istatistik.yok.gov.tr/ . Take some time to see what are offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.

  1. Gather the data, bind them together and save in an .RData file. Make .RData file available online for everybody. Provide the data link in your analysis. You can work together with your friends to provide one comprehensive .RData file if it is more convenient to you. (You don’t need to report any code in this part.)
  2. Perform EDA on the data you collected based on the theme you decided on. Keep it short. One to two pages is enough, three pages tops. If you are interested and want to keep going, write a data blog post about it. I will not grade it but I can share it on social media.

The dataset below retrieved from https://istatistik.yok.gov.tr/ It contains number of students at each university in Turkey. Students are categorized into 4 groups; undergraduate, graduate, master and phd group in 2016-2017 period.

library(tidyverse)
library(ggplot2)
load("yok.Rdata")

Here, you can see total number of students at each university according to their degree-level for each city in Turkey.

yok %>% group_by(city) %>% 
    summarise(count = n(), undergraduate_sum = sum(undergraduate_sum), graduate_sum
 = sum(graduate_sum), master_sum = sum(master_sum), phd_sum = sum(phd_sum))
## # A tibble: 81 x 6
##        city count undergraduate_sum graduate_sum master_sum phd_sum
##       <chr> <int>             <dbl>        <dbl>      <dbl>   <dbl>
##  1    ADANA     4             15067        32304       6463    1563
##  2 ADIYAMAN     2             10359         9445        597      61
##  3    AFYON     3             17042        24613       3464     310
##  4     AĞRI     2              3056         7105        727       0
##  5  AKSARAY     2              5963        13941       2501     130
##  6   AMASYA     3             11020         5217        684      14
##  7   ANKARA    27             24905       196338      60729   20973
##  8  ANTALYA     7             31421        38844       5385    1026
##  9  ARDAHAN     2              1204         3606        189      30
## 10   ARTVİN     2              5369         3525        395      18
## # ... with 71 more rows

Here you can see total number of student(above 10,000) at each university in Ankara

yok_ankara<- yok%>%filter(city=='ANKARA')
ggplot(data = yok_ankara%>%filter(overall_sum>10000),
      aes(x = university_name, y = overall_sum)) + geom_bar(stat = "identity")+
      theme(axis.text.x = element_text(angle = 90))

Below, there is distrubition of students from all degrees categorized by university in Ankara;

ggplot (data = yok_ankara, aes(x = "", y = overall_sum, fill = university_name)) +
  geom_bar(stat = "identity", width = 1) + coord_polar("y", direction = -1) + theme_void() 

library(ggplot2)
theme_set(theme_bw())

ggplot(yok_ankara, aes(x=university_name, y=graduate_sum)) + 
  geom_bar(stat="identity", width=0.5, fill="tomato3") + 
  labs(title="Number of graduate degree students in Ankara") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.6))

library(ggplot2)
theme_set(theme_classic())

g <- ggplot(yok_ankara %>% select(university_name,type2), aes(university_name))
g + geom_bar(aes(fill=type2), width = 0.5) + 
  theme(axis.text.x = element_text(angle=90, vjust=0.6)) + 
  labs(title="Education types in Ankara universities")

Dataset(yok.RData) can be found here: https://github.com/MEF-BDA503/pj-yigithakan/tree/master/files

Html output of final exam can be found here (as of 10.01.2018): https://mef-bda503.github.io/pj-yigithakan/