Final BDA503

General Instructions Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it.

Here are significant points that you should read carefully.

Your final starts on January 6, 2018; 11:00. It ends on January 9, 2018; 11:00. Late submissions until January 9, 2018; 23:59 (penalty -25 points). Your main submissions will be through Blackboard, no email. Please refrain from posting on your progress journals until January 10, 2018. After then, it is appreciated. You will submit RMarkdown generated pdf files. You will submit only a single pdf containing all 3 parts. All works should be individual and original. (Single exception: On data gathering in part 3, you can work together and refer to the same RData file.) Instructor support will be minimal. I will try to answer technically ambiguous points but I will generally not respond to consulting questions (e.g. “Am I doing it ok?” You probably are, given your overall performance.). Questions are designed to measure your opinions and I don’t want to color your perspective. Questions Part I: Short and Simple (20 pts) The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).

1 - What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible? See Hadley Wickham’s point (and other discussion in the topic) before making your argument (https://stackoverflow.com/a/3101876/3608936). See an example of two y-axis graph on https://mef-bda503.github.io/gpj-rjunkies/files/project/index.html#comparing__of_accidents___of_departures

2 - What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

Would you present an argument for a policy that you are more inclined to (e.g. suppose you are more inclined to allocate budget to fix gender inequality than affordable healthcare) or would you just present what data says? In other words, would the (honest) title of your presentation be “Gender Inequality - The Most Important Social Problem Backed by Data” or “Pain Points in Our Society and Optimal Budget Allocation”?

3 - What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?

4 - If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies, after you load ggplot2movies package.)

Part II: Extending Your Group Project (30 pts) In this part you are going to extend your group project with an additional analysis supported by some visualizations. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.

Part III: Welcome to Real Life (50 pts) As all of you know well enough; real life data is not readly available and it is messy. In this part, you are going to gather data from Higher Education Council’s (Y??K) data service. You can use all the data provided on https://istatistik.yok.gov.tr/ . Take some time to see what are offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.

Gender disparity in the academic faculty. Change in the number of people in different academic positions in years. Professor/student ratios. Capacities in different departments. Comparative undergraduate / graduate student populations. Number of foreign students/professors and where they come from. a - Gather the data, bind them together and save in an .RData file. Make .RData file available online for everybody. Provide the data link in your analysis. You can work together with your friends to provide one comprehensive .RData file if it is more convenient to you. (You don’t need to report any code in this part.)

b - Perform EDA on the data you collected based on the theme you decided on. Keep it short. One to two pages is enough, three pages tops. If you are interested and want to keep going, write a data blog post about it. I will not grade it but I can share it on social media.

## Warning: package 'tibble' was built under R version 3.4.3

## Warning: package 'knitr' was built under R version 3.4.3

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

## Warning: package 'readxl' was built under R version 3.4.3

## Warning: package 'arules' was built under R version 3.4.3

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Warning: package 'arulesViz' was built under R version 3.4.3

## Loading required package: grid

    Part I: Short and Simple

Answer-1

Two y-axis graph can be useful in some cases but if it is drawn inappropriately, it can show off the misinformation. Firstly; it needed to drawn data sets which have different units. Otherwise, for example for the bar graph, the relative heights of the bars can no longer be used to compare what axises describe. Secondly; line graph is more appropriate for two y-axis graph because bar graphs are designed for magnitude comparisons. Thirdly; two y-axis graph with line is more convinent than others because line graphs compare values overall patterns to one another, rather than their magnitudes. Finally; it suitable for interval types of data measurement scales. We are using two y-axis graph at job to emphasize relation between number of feedback we recieve and service level performance.

Answer-2

My EDA workflow;

Investigate general information about data Learning more about notion in data, for instance poverty, unemployment etc. Generate question about data like which values is the most common or rare etc. Answer those questions by visualising, transforming, and modelling Straighten the question according to answers For example, for gender inequality school attendance, labour force participation rate, bachelor’s degree etc. can be analysis according gender. With the suitable key product indicator can form for all subjects. And decide according to what data says.

Answer-3

The main differences between them; for the bitcoin prices time has important affect on price but for diamond price is affected mostly features of diamond.

Answer-4

I draw box plot on the x axis Genres and on the y axis Rating. The reason i prefer this chart is convenience of summarise rating of genres with information such as median, max and min ratings.

Part II: Extending Your Group Project

For our project we make basket analysis aisles by aisles. Today i am going to extend our group project by conduct basket analysis for yogurt and fresh fruit aisles. The reason for choosing these aisles is they had third highest confidence value and the first 2 aisles are fresh fruit, fresh vegetables and and packaged fruit and vegatables so it is suprising higher confidince level between these aisle.

NUmber Of Ordered Product From Yogurt Aisle

As it is clear below, most 15 ordered product from Yogurt Aisles are different kind of milks except butter. There may be strong confidence between kind of milk and bestseller fresh fruits.

NUmber Of Ordered Product From Fresh Fruit Aisle

Banana is the most ordered product not only from fresh fruit aisle but olsa entire orders. It is likely to result confidence level with kind of milk, banana and strawberries.

I conduct market basket analysis so that looking for combination of yogurt and fresh fruit aisles of products’ that occur together frequently in transaction. I assign support as 0.0005 because i want to show of product which is sold together more than around 450 times. In addition i assign confidence as 0.05 since i want to show products sold together more than %5 chance.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   5e-04      2
##  maxlen target   ext
##       2  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 787 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1405 item(s), 1575417 transaction(s)] done [0.86s].
## sorting and recoding items ... [337 item(s)] done [0.03s].
## creating transaction tree ... done [1.45s].
## checking subsets of size 1 2

## Warning in apriori(order_product_pivot, parameter = list(supp = 5e-04, conf
## = 0.05, : Mining stopped (maxlen reached). Only patterns up to a length of
## 2 returned!

##  done [0.06s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object  ... done [0.27s].

Apriori Basket Analysis

As it seen on the table there is 176 product relation which confidence level grater than 0.05 and support level grater than 0.0005. Under the rhs columns not suprisingly banana(24852) is the most orderered product with others(53 times) because it is most added product to cart. Strongest relation (add cart together) of banana is with,under lhs, 41787, 20842,18253, 33754, 40571 respectively Bartlett Pears,Greek Yogurt, Fruit Punch Roarin’ Waters, 2% with Strawberry Lowfat Greek Strained Yogurt, Total 2% Greek Strained Yogurt with Cherry. These item sets’ confidence levels are more than 15% and lift values are more than 1 which means they are like to added cart more than 15% chance.

What is the most popoular item set?

21137 - Organic Strawberries and 13176 - Bag of organic Bananas, 13176 - Bag of organic Bananas and 21137 - Organic Strawberries, 47209 - Organic Hass Avocado and 13176 - Bag of Bananas, 13176 - Bag of Bananas and 47209 - Organic Hass Avocado,

With 0,009 - 0,008 support value they are the most popular item set and all of them form the fresh fruit aisles. The item set from the different aisle is not popular as much as from same aisle. Support level mostly high for the item set which include two items from fruit aisle.

According to lift values, top 3 item as fallows;

lhs rhs lift {8309} {28465} 25.049 (Strawberry Yogurt and Blueberry Non-fat Yogurt) {28465} {36865} 21.223 ( Blueberry Non-fat Yogurt and Raspberry Yogurt) {36865} {24799} 19.004 (Raspberry Yogurt and Vanilla Skyr Nonfat Yogurt)

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Graph for 10 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

On the graph there are 10 highest lift value.Color correlate with lift level and size correlate with support value. Darker color refer bigger lift values and bigger size refer bigger support value.

Part III: Welcome to Real Life

Answer-a

In this part i tidied data which is downloaded. Because their header aren’t convienient format i skipped first three rows and i assign available header. Some of numeric data was in chr format and i transformed them to numeric format to proportion them and finally show off the female academician’s and student’s gender proportion according to their degree.

header_a <- c("univercity", "prof_m", "prof_f", "prof_t", "docent_m" , "docent_f" , "docent_t", "assistant_prof_m" , "assistant_prof_f" , "assistant_prof_t" , "teaching_assistant_m" , "teaching_assistant_f" , "teaching_assistant_t" , "lecturer_m" , "lecturer_f" , "lecturer_t" , "specialist_m" , "specialist_f" , "specialist_t" , "research_fellow_m" , "research_fellow_f" , "research_fellow_t" , "translator_m" , "translator_f" , "translator_t" , "academic_planner_m" , "academic_planner_f" , "academic_planner_t" , "grandtotal_academician_m" , "grandtotal_academician_f1","grandtotal_academician_f" , "grandtotal_academician")

academician <- read_excel('academician.xls', skip = 3, col_names = header_a)
academician$grandtotal_academician_f1 <-NULL
academician[2:31] <- lapply(academician[5:19], as.numeric)

head(academician)

saveRDS(academician, "Academician_stat_2016")

header_s <- c("univercity", "type", "city", "education_time", "two_year_degree_m", "two_year_degree_f", "two_year_degree_t","undergraduate_m","undergraduate_f","undergraduate_t","postgraduate_m", "postgraduate_f", "postgraduate_t", "doctoral_m","doctoral_f","doctoral_t","grandtotal_student_m","grandtotal_student_f","grandtotal_student")
student <- read_excel('student.xls', skip = 3, col_names = header_s)
student[5:19] <- lapply(student[5:19], as.numeric)
                                              
head(student)

saveRDS(student, "student_number_2016_2017")

Data Join Process

I joined datas with univercity column to possible usage of wider data information.

academician_student<-left_join(academician,student,by="univercity")
academician_student <- academician_student %>% mutate_if(is.character,as.factor)
saveRDS(academician_student, "Academician_and_student_stat_2016")

Answer-b

General Information

The data which shows number of academical personal belong to 2016 whereas number of student data belong to 2016-2017 academic year. I proportion femalde academician number to total to show female ratio according to academical degree.

#glimpse(academician)
f_academician_ratio <- academician_student %>% transmute(univercity, type,city, female_prof_ratio = round(prof_f / prof_t, 2), female_docent_ratio = round(docent_f / docent_t, 2), female_assistant_prof_ratio = round(assistant_prof_f / assistant_prof_t, 2), female_teaching_assistant_ratio = round(teaching_assistant_f / teaching_assistant_t, 2), female_lecturer_ratio= round(lecturer_f / lecturer_t, 2), female_specialist_ratio = round(specialist_f / specialist_t, 2),female_research_fellow_ratio = round(research_fellow_f / research_fellow_t, 2), female_ratio_total = round(grandtotal_academician_f/grandtotal_academician,2))

ratio_of_female_academician <- f_academician_ratio %>% filter(univercity == "TOPLAM")

ratio_of_female_academician <- ratio_of_female_academician %>% 
  gather("female_prof_ratio":"female_ratio_total", key = "degree", value = "female_proportion")

ratio_of_female_academician$degree %<>% factor

ggplot(data = ratio_of_female_academician %>% arrange(female_proportion), aes(x=degree, y = female_proportion)) +
  geom_bar(fill="#00008B", stat = "identity")+scale_x_discrete(name="type_of_academician") +scale_y_continuous(name="proportion")+ theme( axis.text.x = element_text(angle =90,size=8) ,  axis.title = element_text(size = 10),plot.title = element_text(size=14),axis.text.y = element_text(size = 12 )) + ggtitle('Proportion of female academician')

As it is clear in the graph even though female profesor lower than 40% and assistant profesor ratio is lower than 45% female ratio of academician is almost 50%. In my opinion there is not sufficient evidence to say there is gender disparity .

f_student_ratio <- student %>% transmute(univercity,type,city, f_two_year_degree_ratio = round( two_year_degree_f / two_year_degree_t, 2), f_undergraduate_ratio = round(undergraduate_f / undergraduate_t, 2), f_postgraduate_ratio = round(postgraduate_f / postgraduate_t, 2), f_doctoral_ratio = round(doctoral_f / doctoral_t, 2),f_total_ratio =round(grandtotal_student_f/grandtotal_student, 2))

ratio_of_female_student <- f_student_ratio %>% filter(univercity == "TOPLAM")

ratio_of_female_student <- ratio_of_female_student %>% 
  gather("f_two_year_degree_ratio":"f_total_ratio", key = "degree", value = "female_proportion")

ratio_of_female_student$degree %<>% factor

ggplot(data = ratio_of_female_student %>% arrange(female_proportion), aes(x=degree, y = female_proportion)) +
  geom_bar(fill="#00008B", stat = "identity")+scale_x_discrete(name="degree_of_student") +scale_y_continuous(name="proportion")+ theme( axis.text.x = element_text(angle =90,size=8) ,  axis.title = element_text(size = 10),plot.title = element_text(size=14),axis.text.y = element_text(size = 12 )) + ggtitle('Proportion of female student')

As ratio of female academician state there is not sufficient evidence to say there is gender disparity for female student according to graph.

Final BDA503

Numan Cagatay Atmaca

7 Ocak 2018