General Instructions

Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it.

Here are significant points that you should read carefully.

Part I: Short and Simple (20 pts)

The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).

  1. What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible? See Hadley Wickham’s point (and other discussion in the topic) before making your argument (https://stackoverflow.com/a/3101876/3608936). See an example of two y-axis graph on https://mef-bda503.github.io/gpj-rjunkies/files/project/index.html#comparing__of_accidents___of_departures

Answer to Question 1

As a person who creates graphs from market research results for the internal customers of the market research department in the company, l am not fan of two y-axis graphics because l even explain the basics of the graphics in the market research report. If l use two y-axis graphics, it will be so difficult for the readers to understand the graphics. Not everybody can read two y-axis graphics or is good at numbers. In addition, the main aim of a graphic is to tell the story at a single glance. If the graphics leaves the reader confused, it will not be successful. On the other hand, if I have two different values which represent two groups, l may want to show them in the same graph for the purpose of comparison. Graphics help us to understand the story by comparison. However, two y-axis graphics are not helpful here either as it may lead misunderstanding. Besides there may be a need of two y-axis graphics, if we are writing an academic paper and do not have so many spaces. Again, in this case the readers will have the ability to read two y-axis graphics. Therefore, l do not agree that the idea of two y-axis graphic is a good practice. Simplicity is the best.

  1. What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

    Would you present an argument for a policy that you are more inclined to (e.g. suppose you are more inclined to allocate budget to fix gender inequality than affordable healthcare) or would you just present what data says? In other words, would the (honest) title of your presentation be “Gender Inequality - The Most Important Social Problem Backed by Data” or Pain Points in Our Society and Optimal Budget Allocation?

Answer to Question 2

When l have a dataset and a research question, the first thing l will do is to have some domain knowledge. I will do desk research as fast as l can. Then, check the previous research results, the indicators of the situations, patterns. Then l start with looking at the data set. I look at the data size, column names, content of the columns then check data types and the null values. I try to clean the data set and manipulate and organize it for the analysis and models l am planning to make. I do basic explanatory analysis (5-number summaries, exploratory plotting; drawing boxplot, scatter plot, detect outliers, compare results of different groups, pairwise plotting, correlations). Then l try to do advanced analysis. I analyze the current situation of the country. I want to quantify the social data. I try to find the metrics for welfare. I may look at best practices from the countries that have the highest welfare standards or l use social progress index as Michael Porter defines . Moreover, l find the minimum life standards. After l do all the comparisons, I find the weak points on in these areas. I distribute the funds to the projects on these areas. After investment, l start to follow the metrics and do before-after comparisons. I would just present what data says. I do not add my personal comments on the data set. I put the title as “Pain Points in Our Society and Optimal Budget Allocation. In addition to that research l give extra information from previous and secondary resources.

  1. What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?

Answer to Question 3

Times series data is a flow of information that shows the changes in different periods. It may not be big problem if there are empty values in time series data because we can assume the trend by looking into other periods. This is a difference between time series and non time series data. In time series analysis, you try to forecast future observations but there is no past or future in non time series data. If we want to see the relationship between two variables and plot it, you can put one of the variables to y axis or x axis, both of which are suitable. However, in time series plotting you should put the time on the x axis for readability. In modeling and analysis, we often split our data into a train and a test set: the training set is used to prepare the model and the test set is used to evaluate it. These methods cannot be directly used with time series data. We can not select random sample data from timeseries for validation like we can do with non time series data. We need to do it by selecting samples in subgroups (windows). We cannot validate our model for bitcoin prices by randomly selecting observations for variation or test because the observations are not independent of each other. This is because we assume that there is no relationship between observations and each observation is independent. This is not true for time series data, where the time dimension of observations means that we cannot randomly split them into groups. Instead, we must split data up and respect the temporal order in which values were observed. The data points are associated to each other by order. For example, when we evaluate the time series based on days, we validate the prediction of from following and validate the prediction of & together from.

  1. If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies, after you load ggplot2movies package.)

Answer to Question 4

movie2<-movies %>% filter(votes > 30  & length > 20 )
movie3<-movie2 %>% mutate(normyear = (year-min(year))/(max(year)-min(year)))    #normalizing the year 
movie4<-movie3 %>% mutate(popularity=log(votes))

corrofdimensions <- select(movie4,normyear,length,popularity,rating)
corrplot(cor(corrofdimensions))

We evaluate movies based on their imdb ratings.Thefore the first thing l wondered from the data set is to know why and how a movie is popular? what kind of features of the movie affect its popularity. Thus, l wanted to do correlation analysis of the features. First of all l filtered the votes column that bigger than 30 and length column bigger than 20 by looking at the summary of these columns. l normalized the year feature because of higher magnitude of min year value. I took the logarithm of the votes value because in such systems voting frequency for movies tend to increase exponentially. Logarithm of the votes was labeled as popularity. There is no strong correlation found between the features.

Part II: Extending Your Group Project (30 pts)

In this part you are going to extend your group project with an additional analysis supported by some visualizations. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.

Answer to Part II

In the group project, we tried to utilize from every features in the data set. We did our best to analyze the data set and explain the story behind it. In this question, l relooked at the dataset and tried to find different point on it. My aim was to improve and change the visuals.

ggplot(SalesStore, aes(x = IsHoliday, y = Weekly_Sales)) +
  geom_boxplot()+
  scale_x_discrete(name = "Holiday or not ") +
  scale_y_continuous(name = "Weekly Sales",
                     limits=c(-5000, 700000))+ 
  ggtitle("Weekly Sales based on Holidays")

I did boxplot to see if there is a huge difference in sales between holidays and other weeks. The intervals of the box plots are short, which means sales of the stores are close to each other. In this case, There is not much difference at that point for holiday sales. We can see sales peak on holiday times.

features$Fuel_Price<-as.numeric(as.vector(features$Fuel_Price))

features_temp_m <- features %>% group_by(Store) %>% 
  summarise(meanfuel=mean(Fuel_Price))%>%arrange(desc(meanfuel))  
topstores<-SalesStore%>% group_by(Store) %>% 
  summarise(meansales=mean(Weekly_Sales))%>%arrange(desc(meansales))  

u<-left_join(features_temp_m, topstores, by = "Store")

ggplot(u, aes(x=meanfuel, y=meansales)) +
  geom_point()+ geom_smooth(method='lm')+
    scale_x_discrete(name = "Fuel Prices") +
  scale_y_continuous(name = "Weekly Sales")+ 
  ggtitle("Relationship between Fuel Prices and Sales")

Like other features there was no strong relation between fuel prices and sales. It would better for us to drawing scatter plot of the relation between features and sales then we would not have done detailed analysis about features.

Wventually, unlabelled departmants and insufficient features hindered us to look into the data from different perspectives and to draw an inference. I came to a conclusion for the upcoming projects. If I have to select a data set, I will look into the features in detail. In addition, if I have a real life data that I collect myself, I will try to find and put internal and external conditions to my data set and to do feature engineering.

Part III: Welcome to Real Life (50 pts)

As all of you know well enough; real life data is not readly available and it is messy. In this part, you are going to gather data from Higher Education Council’s data service. You can use all the data provided on https://istatistik.yok.gov.tr/ . Take some time to see what are offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.

  1. Gather the data, bind them together and save in an .RData file. Make .RData file available online for everybody. Provide the data link in your analysis. You can work together with your friends to provide one comprehensive .RData file if it is more convenient to you. (You don’t need to report any code in this part.)
  2. Perform EDA on the data you collected based on the theme you decided on. Keep it short. One to two pages is enough, three pages tops. If you are interested and want to keep going, write a data blog post about it. I will not grade it but I can share it on social media.

Answer to Part III

I chose data set of the number of foreign students based on their nationality in order to look into their distribution in universities. Here is the data set l used.

data16 <- read.csv2("Uyrugagore201516.csv", header = TRUE) 
#labeling the education period of 2015-2016 as the year of 2016
data17 <- read.csv2("Uyrugagore201617.csv", header = TRUE) 
#labeling the education period of 2016-2017 as the year of 2017
summary(data16)  #looking at the data
##                 UNIVERSITE.ADI UNIVERSITE.TURU       IL.ADI    
##  ISTANBUL UNIVERSITESI : 136         :   1     ISTANBUL :1773  
##  ANKARA UNIVERSITESI   : 125   DEVLET:4292     ANKARA   : 683  
##  MARMARA UNIVERSITESI  : 119   VAKIF :1568     IZMIR    : 407  
##  HACETTEPE UNIVERSITESI: 111                   ESKISEHIR: 192  
##  ANADOLU UNIVERSITESI  : 110                   KONYA    : 160  
##  GAZI UNIVERSITESI     : 107                   KOCAELI  : 117  
##  (Other)               :5153                   (Other)  :2529  
##                           UYRUK          ERKEK         
##  AZERBAYCAN CUMHURIYETI      : 133   Min.   :    0.00  
##  SURIYE ARAP CUMHURIYETI     : 128   1st Qu.:    1.00  
##  TURKMENISTAN                : 124   Median :    2.00  
##  IRAN ISLAM CUMHURIYETI      : 110   Mean   :   18.72  
##  IRAK CUMHURIYETI            : 109   3rd Qu.:    5.00  
##  AFGANISTAN ISLAM CUMHURIYETI: 107   Max.   :54855.00  
##  (Other)                     :5150                     
##      KADIN               TOPLAM       
##  Min.   :    0.000   Min.   :    1.0  
##  1st Qu.:    0.000   1st Qu.:    1.0  
##  Median :    1.000   Median :    3.0  
##  Mean   :    8.683   Mean   :   27.4  
##  3rd Qu.:    3.000   3rd Qu.:    8.0  
##  Max.   :25445.000   Max.   :80300.0  
## 
summary(data17) #checking if the two files have the same column names to join them.
#deleting useless columns
data17<-data17 %>% select (-c(X,X.1,X.2,X.3, X.4, X.5, X.6, X.7, X.8, X.9 ))
data17<-data17 %>% mutate (YEAR=2017)   #adding the year column to the data set
data16<-data16 %>% mutate (YEAR=2016)
data17<-data17 %>% tbl_df()
data16<-data16 %>% tbl_df()
d_All<-rbind(data16,data17) #merging the two files
#changing the Turkish column names to English
d_All<-d_All %>% rename("NAME"="UNIVERSITE.ADI","TYPE"="UNIVERSITE.TURU",
                        "CITY"="IL.ADI","NATIONALITY"="UYRUK","FEMALE"="KADIN", "MALE"="ERKEK", "TOTAL"="TOPLAM")
#looking at the data
glimpse(d_All)
## Observations: 13,113
## Variables: 8
## $ NAME        <fctr> ABANT IZZET BAYSAL UNIVERSITESI, ABANT IZZET BAYS...
## $ TYPE        <fctr> DEVLET, DEVLET, DEVLET, DEVLET, DEVLET, DEVLET, D...
## $ CITY        <fctr> BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, B...
## $ NATIONALITY <fctr> AFGANISTAN ISLAM CUMHURIYETI, ALMANYA FEDERAL CUM...
## $ MALE        <int> 21, 1, 0, 1, 5, 2, 1, 0, 1, 1, 4, 1, 0, 2, 0, 4, 1...
## $ FEMALE      <int> 14, 3, 2, 4, 4, 0, 0, 1, 1, 0, 4, 1, 1, 0, 1, 0, 1...
## $ TOTAL       <int> 35, 4, 2, 5, 9, 2, 1, 1, 2, 1, 8, 2, 1, 2, 1, 4, 2...
## $ YEAR        <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20...

There are 13,113 observations and 8 variables.

which(is.na.data.frame(d_All)) #checking null values
## integer(0)
d_All<-d_All %>% filter(NAME!="TOPLAM") #removing the last row
type<- d_All %>%
  group_by(TYPE)%>% summarise(TOTALNUM=sum(TOTAL))

typec<-d_All %>%count(TYPE)
typeL<-left_join(type, typec, by = "TYPE")
typeL<-typeL %>% mutate(Students_per_University=TOTALNUM/n)
typeL<-typeL %>% rename("Number of Universities"="n") %>% kable()
print (typeL)
## 
## 
## TYPE      TOTALNUM   Number of Universities   Students_per_University
## -------  ---------  -----------------------  ------------------------
## DEVLET      157025                     9514                 16.504625
## VAKIF        29745                     3597                  8.269391

There are 157025 foreign students in government universities. Goverment universities has twice as many foreign students in comparison to the private universities.

per<- d_All %>%
  group_by(YEAR) %>% 
  summarise(NumberofStudents=sum(TOTAL),NumberofFemaleStudents=sum(FEMALE),NumberofMaleStudents=sum(MALE))%>%
  mutate(Percantageofmalestudents=(NumberofMaleStudents/NumberofStudents)*100) %>% kable()
print(per)
## 
## 
##  YEAR   NumberofStudents   NumberofFemaleStudents   NumberofMaleStudents   Percantageofmalestudents
## -----  -----------------  -----------------------  ---------------------  -------------------------
##  2016              80300                    25445                  54855                   68.31258
##  2017             106470                    35625                  70845                   66.53987

There are more foreign students in 2017 than in 2016. 68% of the students is male in 2016 (almost same as 2017).

d_MALE<-d_All %>% select(NAME,TYPE,CITY,NATIONALITY,MALE,YEAR)
d_FEMALE<-d_All %>%select(NAME,TYPE,CITY,NATIONALITY,FEMALE,YEAR)
#icreating a new data set to plot the analyzes based on genders
d_MALE<-d_MALE %>% mutate (GENDER="MALE")
d_FEMALE<-d_FEMALE %>% mutate (GENDER="FEMALE")

d_MALE<-d_MALE %>% rename("NUMBEROFSTUDENTS"="MALE")
d_FEMALE<-d_FEMALE %>% rename("NUMBEROFSTUDENTS"="FEMALE")

d_All_GENDER<-rbind(d_MALE,d_FEMALE)
ggplot(data=d_All_GENDER, aes(x=YEAR, y=NUMBEROFSTUDENTS, fill=GENDER)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_brewer(palette="Paired")+
  scale_x_continuous(breaks = c(2016,2017)) +
  xlab("Years") + ylab("Number of Students") +
  ggtitle("Number of Students based on Gender") +
  theme_minimal()

I want to show the graphic of the table above. It is obvious that the number of male students is higher than female students.

Universities have the Most Foreign Students

UNI16<-d_All %>% filter(YEAR == "2016")%>% select(-NATIONALITY)
TopUni16<- UNI16 %>% group_by(NAME) %>% summarise(TotalNumbersOfStudents2016=sum(TOTAL))%>%
  arrange(desc(TotalNumbersOfStudents2016))%>%slice(1:10)
print(TopUni16)
## # A tibble: 10 x 2
##    NAME                         TotalNumbersOfStudents2016
##    <fctr>                                            <int>
##  1 ISTANBUL UNIVERSITESI                              5553
##  2 ANADOLU UNIVERSITESI                               3286
##  3 SAKARYA UNIVERSITESI                               2900
##  4 GAZIANTEP UNIVERSITESI                             2382
##  5 ANKARA UNIVERSITESI                                2198
##  6 MARMARA UNIVERSITESI                               2179
##  7 USAK UNIVERSITESI                                  2146
##  8 ISTANBUL TEKNIK UNIVERSITESI                       2131
##  9 ULUDAG UNIVERSITESI                                2096
## 10 GAZI UNIVERSITESI                                  2034
UNI17<-d_All %>% filter(YEAR == "2017")%>% select(-NATIONALITY)
TopUni17<- UNI17 %>% group_by(NAME) %>% summarise(TotalNumbersOfStudents2017=sum(TOTAL))%>%
  arrange(desc(TotalNumbersOfStudents2017))%>%slice(1:10)

print(TopUni17)
## # A tibble: 10 x 2
##    NAME                        TotalNumbersOfStudents2017
##    <fctr>                                           <int>
##  1 ISTANBUL UNIVERSITESI                             7661
##  2 ANADOLU UNIVERSITESI                              4770
##  3 GAZIANTEP UNIVERSITESI                            3444
##  4 ULUDAG UNIVERSITESI                               3201
##  5 SAKARYA UNIVERSITESI                              3153
##  6 ISTANBUL AYDIN UNIVERSITESI                       2835
##  7 MARMARA UNIVERSITESI                              2638
##  8 ANKARA UNIVERSITESI                               2518
##  9 ONDOKUZ MAYIS UNIVERSITESI                        2462
## 10 SELCUK UNIVERSITESI                               2433

Istanbul University has the most foreign students in 2016. It is followed by Anadolu University and Sakarya University. The order of the universities that have the most foreign students does not change in 2017.

Numbers of foreign Students based on Nationality in 2016 & 2017

NA16<-d_All %>% filter(YEAR == "2016")%>% group_by(NATIONALITY) %>%
  select(-NAME,-TYPE,-CITY,-MALE,FEMALE)
NA_16<-NA16%>% summarise(TOTALNUMSOFSTUDENTSATUNI=sum(TOTAL))%>%
  arrange(desc(TOTALNUMSOFSTUDENTSATUNI))%>%slice(1:10)

print(NA_16)
## # A tibble: 10 x 2
##    NATIONALITY                  TOTALNUMSOFSTUDENTSATUNI
##    <fctr>                                          <int>
##  1 AZERBAYCAN CUMHURIYETI                          11967
##  2 TURKMENISTAN                                     9456
##  3 SURIYE ARAP CUMHURIYETI                          9170
##  4 IRAN ISLAM CUMHURIYETI                           5563
##  5 AFGANISTAN ISLAM CUMHURIYETI                     4080
##  6 IRAK CUMHURIYETI                                 3925
##  7 YUNANISTAN CUMHURIYETI                           1959
##  8 KIRGIZ CUMHURIYETI                               1750
##  9 KAZAKISTAN CUMHURIYETI                           1736
## 10 LIBYA DEVLETI                                    1357
NA17<-d_All %>% filter(YEAR == "2017")%>% group_by(NATIONALITY) %>%
  select(-NAME,-TYPE,-CITY,-MALE,FEMALE)
NA_17<-NA17%>% summarise(TOTALNUMSOFSTUDENTSATUNI=sum(TOTAL))%>%
  arrange(desc(TOTALNUMSOFSTUDENTSATUNI))%>%slice(1:10)

print(NA_17)
## # A tibble: 10 x 2
##    NATIONALITY                  TOTALNUMSOFSTUDENTSATUNI
##    <fctr>                                          <int>
##  1 SURIYE ARAP CUMHURIYETI                         14870
##  2 AZERBAYCAN CUMHURIYETI                          14859
##  3 TURKMENISTAN                                    10409
##  4 IRAN ISLAM CUMHURIYETI                           6067
##  5 AFGANISTAN ISLAM CUMHURIYETI                     5237
##  6 IRAK CUMHURIYETI                                 4664
##  7 ALMANYA FEDERAL CUMHURIYETI                      3705
##  8 YUNANISTAN CUMHURIYETI                           2279
##  9 BULGARISTAN CUMHURIYETI                          2027
## 10 KIRGIZ CUMHURIYETI                               2020

The biggest foreign student community is from the Republic of Azerbaijan in 2016. It is followed by Turkmenistan and Syria. In the given list, the top 3 remains the same in 2017; however in 2017, the biggest foreign student community is from Syria. There is dramatic increase in number of Syrian students.

Number of Foreign Students in Istanbul

ist<-d_All_GENDER %>% filter(CITY == "ISTANBUL")%>% group_by(YEAR)%>% 
  summarise(TOTAList=sum(NUMBEROFSTUDENTS))
tot<-d_All_GENDER %>% group_by(YEAR)%>% summarise(TOTAL=sum(NUMBEROFSTUDENTS))
ty<-inner_join(ist, tot, by = "YEAR")%>%mutate(RATIO=(TOTAList/TOTAL)*100) %>% kable() 
print(ty)
## 
## 
##  YEAR   TOTAList    TOTAL      RATIO
## -----  ---------  -------  ---------
##  2016      21559    80300   26.84807
##  2017      29207   106470   27.43214

Istanbul University has 27% of the foreign students in 2016. There is no significant increase from 2016 to 2017. Let’s show this on the graphic.

res<-d_All_GENDER %>% filter(CITY == "ISTANBUL")%>% group_by(YEAR)%>%  summarise(TOTAL=sum(NUMBEROFSTUDENTS))%>%mutate(KIND="ISTANBUL")

re<-d_All_GENDER %>% group_by(YEAR)%>% 
  summarise(TOTAL=sum(NUMBEROFSTUDENTS))%>%mutate(KIND="ALL")
ress <- union_all(res,re)

ress %>%
  ggplot(aes(x=YEAR,y=TOTAL,fill=KIND))+
  geom_bar(stat="identity", position=position_dodge())+
  scale_x_continuous(breaks = c(2016,2017)) +
  xlab("Years") + ylab("Number of Students") +
  ggtitle("Number of Students in Istanbul & Turkey")