Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it.
Here are significant points that you should read carefully.
The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).
What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible? See Hadley Wickham’s point (and other discussion in the topic) before making your argument (https://stackoverflow.com/a/3101876/3608936). See an example of two y-axis graph on https://mef-bda503.github.io/gpj-rjunkies/files/project/index.html#comparing__of_accidents___of_departures
What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?
Would you present an argument for a policy that you are more inclined to (e.g. suppose you are more inclined to allocate budget to fix gender inequality than affordable healthcare) or would you just present what data says? In other words, would the (honest) title of your presentation be “Gender Inequality - The Most Important Social Problem Backed by Data” or “Pain Points in Our Society and Optimal Budget Allocation”?
What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?
If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies
, after you load ggplot2movies
package.)
library(ggplot2movies)
movies
## # A tibble: 58,788 x 24
## title year leng… budg… rati… votes r1 r2 r3 r4 r5 r6
## <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 $ 1971 121 NA 6.40 348 4.50 4.50 4.50 4.50 14.5 24.5
## 2 $100… 1939 71 NA 6.00 20 0 14.5 4.50 24.5 14.5 14.5
## 3 $21 … 1941 7 NA 8.20 5 0 0 0 0 0 24.5
## 4 $40,… 1996 70 NA 8.20 6 14.5 0 0 0 0 0
## 5 $50,… 1975 71 NA 3.40 17 24.5 4.50 0 14.5 14.5 4.50
## 6 $pent 2000 91 NA 4.30 45 4.50 4.50 4.50 14.5 14.5 14.5
## 7 $win… 2002 93 NA 5.30 200 4.50 0 4.50 4.50 24.5 24.5
## 8 '15' 2002 25 NA 6.70 24 4.50 4.50 4.50 4.50 4.50 14.5
## 9 '38 1987 97 NA 6.60 18 4.50 4.50 4.50 0 0 0
## 10 '49-… 1917 61 NA 6.00 51 4.50 0 4.50 4.50 4.50 44.5
## # ... with 58,778 more rows, and 12 more variables: r7 <dbl>, r8 <dbl>, r9
## # <dbl>, r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>, Comedy
## # <int>, Drama <int>, Documentary <int>, Romance <int>, Short <int>
In this part you are going to extend your group project with an additional analysis supported by some visualizations. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.
As all of you know well enough; real life data is not readly available and it is messy. In this part, you are going to gather data from Higher Education Council’s (YÖK) data service. You can use all the data provided on https://istatistik.yok.gov.tr/ . Take some time to see what are offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.