Part I:

Question 1 : What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible?

My answer :I did not use dual-scaled axes graphs at my work or my projects because they looks incomprehensible for me. When i look at the dual-scaled axis graphs of “rjunkies”, 1st graph is understandable but second graph is awful, hard to understand. Why i do not prefer to use them; Hard to separate lines, bars and interprete them, most of the people do not understand at first glance. It is important to make the graphs simple and apparent for presentations and projects while you make the presentations to third parties who does not know about the content of your work. What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

Question 2 : What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

My answer 2 :My workflow is as below;

1. Importing the data to R

2. check working directory and create a file to save all works than set as working directory to this file

3. check data with summary, glimpse, describe, attach(to work easily with columns), names, summary

4. check data NA values (above functions will give an idea about it), omit NA’s or transform it acc to your data (fill with mean value if it will not effect the data)

5. for sample data about funds and donation, better to check unemployment rates, education/school status of the area, health rates or recorded diseases to decide to concentrate about your work

6. After you check the variables separately, group the necessary variables like “female”, “male” and filter acc to “education”, “unemployment rates” to decide gender inequality, we need mutate, groupby, filter

7. When you have some grouped or fileted data, you can visualize them to see how it looks like with ggplot2

8. If we have area/city information, better to check the distribution in general for our variables as “educaton”, “unemployment”, “healthcare”

9. At this stage we should look correlation btw variables to work on a model

10. Correlation will direct us to continue with which variables, than we can go through K-means and PCA analysis to see where the data centers to focus on that groups

11. last part is decision tree, we will decide where to spend our funds to improve which part of weak points

12. Conclusing, it is always good to explain each step and make comments with reasons. at the end of the work, conclusion part make work efficient. When i decide i check what data says and what is the social staus of all these problems in the country. I could decide acc to most painful parts of society, while society has healthcare problems, you can not concentrate education. Gender inequality is a big problem for most of the countries but if unemployment is high also for “male”, you can not decide to spend the funds for “female” improvements.

Question 3 : What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?

My Answer : Time Series Analysis; if our data is dealing with time, our analysis must be auto-correlated with time, trend is important to make a prediction&forecast. Changes fastly, if you look at the data of “bitcoin”, opening-closing and min-max values can be very different at the same day.
Non-time series : Data is not dependent with time, like “diamonds” data, price is not changing with data, there are categorical variables like “cut”, “color”, “density” correlated with price.

Question 4 : If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies, after you load ggplot2movies package.)

My answer :

library(ggplot2movies)
head(movies)

##                      title year length budget rating votes   r1   r2  r3
## 1                        $ 1971    121     NA    6.4   348  4.5  4.5 4.5
## 2        $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5
## 3   $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0
## 4                  $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0
## 5 $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0
## 6                    $pent 2000     91     NA    4.3    45  4.5  4.5 4.5
##     r4   r5   r6   r7   r8   r9  r10 mpaa Action Animation Comedy Drama
## 1  4.5 14.5 24.5 24.5 14.5  4.5  4.5           0         0      1     1
## 2 24.5 14.5 14.5 14.5  4.5  4.5 14.5           0         0      1     0
## 3  0.0  0.0 24.5  0.0 44.5 24.5 24.5           0         1      0     0
## 4  0.0  0.0  0.0  0.0  0.0 34.5 45.5           0         0      1     0
## 5 14.5 14.5  4.5  0.0  0.0  0.0 24.5           0         0      0     0
## 6 14.5 14.5 14.5  4.5  4.5 14.5 14.5           0         0      0     1
##   Documentary Romance Short
## 1           0       0     0
## 2           0       0     0
## 3           0       0     1
## 4           0       0     0
## 5           0       0     0
## 6           0       0     0

I wanted to check movies after year 2000 ratings for Animation movies and others. Rating is increasing.

library(ggplot2)
ggplot(movies[movies$year > 2000,],aes(x=factor(year),y=rating, fill=factor(Animation))) + 
  geom_violin()

Part II :

In our project, i wanted to go deeper for salary levels which we did not analyse in some parts. Therefore, i have added 1 more part to see the salary level changes in years.
We have shown salary distribution in department base but i wanted to check other factors to compare salary levels in order to decide that employees are not leaving due to “salary” by itself. Because most of them have low level salaries except management level.

library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)
library(plotly)
d=read.csv("HR_comma_sep.csv")
d<- d %>% rename("departments" = "sales") %>% tbl_df()
head(d)

## # A tibble: 6 x 10
##   satisfaction_level last_evaluation number_project average_montly_hours
##                <dbl>           <dbl>          <int>                <int>
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
## # ... with 6 more variables: time_spend_company <int>,
## #   Work_accident <int>, left <int>, promotion_last_5years <int>,
## #   departments <fctr>, salary <fctr>

g <- d %>% 
  mutate(salary = ordered(salary, c("low", "medium", "high"))) %>% 
  count(time_spend_company, salary) %>% 
  group_by(time_spend_company) %>% 
  mutate(n = n / sum(n))
  ggplot(g, aes(time_spend_company, n)) +
  geom_violin(aes(fill = salary))

According to above chart, employees till 4 years have low salary, 5-7 years are in medium salary level and if you continue to work in the company, after 8 years salary levels are going to high band. Company do not prefer to give a high salary to talented and high motivated people up to work min 8 years in the company. This part was missing in our project to complete the comments for salary and focus to other factors for our analysis about the reason of left employees. “salary” is not the first factor for quitted employees but all of them have same problem about salary. As a conclusion, company could pay higher to capable employees to keep them in the company.

Part III :

I have downloaded data of foreign students in universities for 2016 and 2017 from OSYM data from https://istatistik.yok.gov.tr/. I wanted to see the volume of foreign students and which Nationalities prefer mostly to study in Turkey.

#Read both files for 2016 and 2017
data1=read.csv2("2016.csv", header = T)
data2=read.csv2("2017.csv", header = T)
data_all = rbind(data1, data2)
data_new = data_all %>% 
  rename("Year" = "ï..YIL","University_Name" = "UNIVERSITE.ADI",  "University_Type" = "UNIVERSITE.TURU",  "City_Name" = "IL.ADI","Nationality" = "UYRUK", "Male" = "ERKEK", "Female" = "KADIN", "Total" = "TOPLAM" ) %>% 
  tbl_df()
attach(data_new)

glimpse(data_new)

## Observations: 13,160
## Variables: 8
## $ Year            <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ University_Name <fctr> ABANT IZZET BAYSAL UNIVERSITESI, ABANT IZZET ...
## $ University_Type <fctr> DEVLET, DEVLET, DEVLET, DEVLET, DEVLET, DEVLE...
## $ City_Name       <fctr> BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOL...
## $ Nationality     <fctr> AFGANISTAN ISLAM CUMHURIYETI, ALMANYA FEDERAL...
## $ Male            <int> 21, 1, 0, 1, 5, 2, 1, 0, 1, 1, 4, 1, 0, 2, 0, ...
## $ Female          <int> 14, 3, 2, 4, 4, 0, 0, 1, 1, 0, 4, 1, 1, 0, 1, ...
## $ Total           <int> 35, 4, 2, 5, 9, 2, 1, 1, 2, 1, 8, 2, 1, 2, 1, ...

summary(data_new)

##       Year                    University_Name   University_Type
##  Min.   :2016   ISTANBUL UNIVERSITESI :  299            :   2  
##  1st Qu.:2016   ANKARA UNIVERSITESI   :  253   DEVLET   :9540  
##  Median :2017   MARMARA UNIVERSITESI  :  250   VAKIF    :3604  
##  Mean   :2017   HACETTEPE UNIVERSITESI:  230   VAKIF MYO:  14  
##  3rd Qu.:2017   ANADOLU UNIVERSITESI  :  229                   
##  Max.   :2017   ULUDAG UNIVERSITESI   :  218                   
##                 (Other)               :11681                   
##      City_Name                          Nationality         Male         
##  ISTANBUL :4040   AZERBAYCAN CUMHURIYETI      :  281   Min.   :    0.00  
##  ANKARA   :1565   SURIYE ARAP CUMHURIYETI     :  276   1st Qu.:    1.00  
##  IZMIR    : 880   TURKMENISTAN                :  264   Median :    2.00  
##  ESKISEHIR: 403   IRAN ISLAM CUMHURIYETI      :  237   Mean   :   19.13  
##  KONYA    : 381   IRAK CUMHURIYETI            :  235   3rd Qu.:    5.00  
##  KOCAELI  : 269   AFGANISTAN ISLAM CUMHURIYETI:  232   Max.   :70926.00  
##  (Other)  :5622   (Other)                     :11635                     
##      Female            Total          
##  Min.   :    0.0   Min.   :     1.00  
##  1st Qu.:    0.0   1st Qu.:     1.00  
##  Median :    1.0   Median :     3.00  
##  Mean   :    9.3   Mean   :    28.43  
##  3rd Qu.:    3.0   3rd Qu.:     8.00  
##  Max.   :35682.0   Max.   :106608.00  
##

We check data if there is any NA values

which(is.na.data.frame(data_new))
na.omit(data_new)

data_last = data_new %>% filter(University_Name!= "TOPLAM")
data_last1 = data_last %>% 
  count(University_Type) %>% 
  mutate(n = n / sum(n), n = percent(n)) %>% 
  formattable(list(area(T, 1:2) ~ color_tile("pink", "grey")), align = 'l')
data_last1

University_Type	n
DEVLET	72.50%
VAKIF	27.39%
VAKIF MYO	0.11%

We will see the Universities have more than 500 foreign students in below

data_last = data_new %>% filter(University_Name!= "TOPLAM")
df <- data_last%>% select(Year, University_Name, City_Name, Total) %>%  
  filter(Total >=500) %>%
  group_by(University_Name) %>%
  summarise_each(funs(first(na.omit(.)))) %>% arrange(desc(Total)) 
head(df)

## # A tibble: 6 x 4
##          University_Name  Year City_Name Total
##                   <fctr> <int>    <fctr> <int>
## 1 GAZIANTEP UNIVERSITESI  2016 GAZIANTEP  1080
## 2   KARABUK UNIVERSITESI  2017   KARABUK   927
## 3   ANADOLU UNIVERSITESI  2016 ESKISEHIR   774
## 4    MERSIN UNIVERSITESI  2017    MERSIN   741
## 5      USAK UNIVERSITESI  2016      USAK   721
## 6    TRAKYA UNIVERSITESI  2016    EDIRNE   683

Gaziantep is the first in 2016 with foreign students, it is most probably the location of the city.

We will look on the bar graph to see clearly,

df$University_Name=substr(df$University_Name,1,10)
ggplot(data=df, aes(x=reorder(University_Name, desc(Total)),y=Total,fill=City_Name))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
   labs(title = "Top Universities with Foreign Students")+
    labs(x = "University Name", y = "Count") +
expand_limits(y=10)

Due to location of Gaziantep, Gaziantep University is the first, most probably they are Syrian citizens. Very interestingly Karabuk is following Gaziantep.

We will see the distribution of Nationalities as per cities to see the volume in cities

df2 <- data_last %>% select(Year, City_Name ,Nationality, Total) %>%  
  filter(Total >=500) %>%
  group_by(Nationality) %>% group_by(City_Name) %>%
  summarise_each(funs(first(na.omit(.)))) %>% group_by(Nationality) %>% 
  arrange(desc(Total))
head(df2)

## # A tibble: 6 x 4
## # Groups:   Nationality [3]
##   City_Name  Year             Nationality Total
##      <fctr> <int>                  <fctr> <int>
## 1 GAZIANTEP  2016 SURIYE ARAP CUMHURIYETI  1080
## 2   KARABUK  2017 SURIYE ARAP CUMHURIYETI   927
## 3 ESKISEHIR  2016  AZERBAYCAN CUMHURIYETI   774
## 4    MERSIN  2017 SURIYE ARAP CUMHURIYETI   741
## 5      USAK  2016  AZERBAYCAN CUMHURIYETI   721
## 6    EDIRNE  2016  YUNANISTAN CUMHURIYETI   683

df2$Nationality=substr(df2$Nationality,1,10)
ggplot(data=df2, aes(x=reorder(Nationality, desc(Total)),y=Total,fill=City_Name))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
   labs(title = "Nationalities & Cities")+
    labs(x = "Universities", y = "Count") +
expand_limits(y=10)

Selecting Nationalities

df3 <- data_last %>% select(Year, University_Name, Nationality, City_Name, Total) %>%  
  filter(Total >=400) %>%
  group_by(University_Name) %>%
  summarise_each(funs(first(na.omit(.)))) %>% arrange(desc(Total))

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`

head(df3)

## # A tibble: 6 x 5
##                         University_Name  Year             Nationality
##                                  <fctr> <int>                  <fctr>
## 1                GAZIANTEP UNIVERSITESI  2016 SURIYE ARAP CUMHURIYETI
## 2                  ANADOLU UNIVERSITESI  2016  AZERBAYCAN CUMHURIYETI
## 3                     USAK UNIVERSITESI  2016  AZERBAYCAN CUMHURIYETI
## 4                   TRAKYA UNIVERSITESI  2016  YUNANISTAN CUMHURIYETI
## 5 KAHRAMANMARAS SUTCU IMAM UNIVERSITESI  2017 SURIYE ARAP CUMHURIYETI
## 6                 ISTANBUL UNIVERSITESI  2016  AZERBAYCAN CUMHURIYETI
## # ... with 2 more variables: City_Name <fctr>, Total <int>

I wanted to see the Nationalities dispersion by Universities,

df3$University_Name=substr(df3$University_Name,1,16)
ggplot(data=df3, aes(x=reorder(University_Name, desc(Nationality)),y=Total,fill=Nationality))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
   labs(title = "Top 6 Nationalities")+
    labs(x = "Universities", y = "Count") +
expand_limits(y=10)

Conclusion : Above chart shows us that Syrian students are mostly Southern Part of Turkey, it is same for most of them except some of them as Iranian and Azerbaijan students. I did not expect several students from Azerbaijan. In general, distribution is normal acc to locations but we have some interesting cities and universities which have the foreigner students as Karabuk, Usak. Syrian and Azerbaijani students are on top.

My .RData file is available in below link;

https://github.com/MEF-BDA503/pj-MeryemKemerci/blob/master/OSYM_Final_Exam_Data.RData

FinalExam_BDA503

Meryem Kemerci

8 January 2018

Part I:

Part II :

Part III :