Question 1 : What is your opinion about two y-axis graphs? Do you use it at work? Is it a good practice, a necessary evil, or plain horrible?
My answer :I did not use dual-scaled axes graphs at my work or my projects because they looks incomprehensible for me. When i look at the dual-scaled axis graphs of “rjunkies”, 1st graph is understandable but second graph is awful, hard to understand. Why i do not prefer to use them; Hard to separate lines, bars and interprete them, most of the people do not understand at first glance. It is important to make the graphs simple and apparent for presentations and projects while you make the presentations to third parties who does not know about the content of your work. What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?
Question 2 : What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?
My answer 2 :My workflow is as below;
1. Importing the data to R
2. check working directory and create a file to save all works than set as working directory to this file
3. check data with summary
, glimpse
, describe
, attach
(to work easily with columns), names
, summary
4. check data NA values (above functions will give an idea about it), omit NA’s or transform it acc to your data (fill with mean value if it will not effect the data)
5. for sample data about funds and donation, better to check unemployment rates, education/school status of the area, health rates or recorded diseases to decide to concentrate about your work
6. After you check the variables separately, group the necessary variables like “female”, “male” and filter
acc to “education”, “unemployment rates” to decide gender inequality, we need mutate
, groupby
, filter
7. When you have some grouped or fileted data, you can visualize them to see how it looks like with ggplot2
8. If we have area/city information, better to check the distribution in general for our variables as “educaton”, “unemployment”, “healthcare”
9. At this stage we should look correlation btw variables to work on a model
10. Correlation will direct us to continue with which variables, than we can go through K-means and PCA analysis to see where the data centers to focus on that groups
11. last part is decision tree, we will decide where to spend our funds to improve which part of weak points
12. Conclusing, it is always good to explain each step and make comments with reasons. at the end of the work, conclusion part make work efficient. When i decide i check what data says and what is the social staus of all these problems in the country. I could decide acc to most painful parts of society, while society has healthcare problems, you can not concentrate education. Gender inequality is a big problem for most of the countries but if unemployment is high also for “male”, you can not decide to spend the funds for “female” improvements.
Question 3 : What are the differences between time series and non time series data in terms of analysis, modeling and validation? In other words what makes Bitcoin price movements analysis different from diamonds (or carat) data set?
My Answer : Time Series Analysis; if our data is dealing with time, our analysis must be auto-correlated with time, trend is important to make a prediction&forecast. Changes fastly, if you look at the data of “bitcoin”, opening-closing and min-max values can be very different at the same day.
Non-time series : Data is not dependent with time, like “diamonds” data, price is not changing with data, there are categorical variables like “cut”, “color”, “density” correlated with price.
Question 4 : If you had to plot a single graph using the data below what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?movies
, after you load ggplot2movies
package.)
My answer :
library(ggplot2movies)
head(movies)
## title year length budget rating votes r1 r2 r3
## 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5
## 2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5
## 3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0
## 4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0
## 5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0
## 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5
## r4 r5 r6 r7 r8 r9 r10 mpaa Action Animation Comedy Drama
## 1 4.5 14.5 24.5 24.5 14.5 4.5 4.5 0 0 1 1
## 2 24.5 14.5 14.5 14.5 4.5 4.5 14.5 0 0 1 0
## 3 0.0 0.0 24.5 0.0 44.5 24.5 24.5 0 1 0 0
## 4 0.0 0.0 0.0 0.0 0.0 34.5 45.5 0 0 1 0
## 5 14.5 14.5 4.5 0.0 0.0 0.0 24.5 0 0 0 0
## 6 14.5 14.5 14.5 4.5 4.5 14.5 14.5 0 0 0 1
## Documentary Romance Short
## 1 0 0 0
## 2 0 0 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
I wanted to check movies after year 2000 ratings for Animation movies and others. Rating is increasing.
library(ggplot2)
ggplot(movies[movies$year > 2000,],aes(x=factor(year),y=rating, fill=factor(Animation))) +
geom_violin()
In our project, i wanted to go deeper for salary levels which we did not analyse in some parts. Therefore, i have added 1 more part to see the salary level changes in years.
We have shown salary distribution in department base but i wanted to check other factors to compare salary levels in order to decide that employees are not leaving due to “salary” by itself. Because most of them have low level salaries except management level.
library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)
library(plotly)
d=read.csv("HR_comma_sep.csv")
d<- d %>% rename("departments" = "sales") %>% tbl_df()
head(d)
## # A tibble: 6 x 10
## satisfaction_level last_evaluation number_project average_montly_hours
## <dbl> <dbl> <int> <int>
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## # ... with 6 more variables: time_spend_company <int>,
## # Work_accident <int>, left <int>, promotion_last_5years <int>,
## # departments <fctr>, salary <fctr>
g <- d %>%
mutate(salary = ordered(salary, c("low", "medium", "high"))) %>%
count(time_spend_company, salary) %>%
group_by(time_spend_company) %>%
mutate(n = n / sum(n))
ggplot(g, aes(time_spend_company, n)) +
geom_violin(aes(fill = salary))
According to above chart, employees till 4 years have low salary, 5-7 years are in medium salary level and if you continue to work in the company, after 8 years salary levels are going to high band. Company do not prefer to give a high salary to talented and high motivated people up to work min 8 years in the company. This part was missing in our project to complete the comments for salary and focus to other factors for our analysis about the reason of left employees. “salary” is not the first factor for quitted employees but all of them have same problem about salary. As a conclusion, company could pay higher to capable employees to keep them in the company.
I have downloaded data of foreign students in universities for 2016 and 2017 from OSYM data from https://istatistik.yok.gov.tr/. I wanted to see the volume of foreign students and which Nationalities prefer mostly to study in Turkey.
#Read both files for 2016 and 2017
data1=read.csv2("2016.csv", header = T)
data2=read.csv2("2017.csv", header = T)
data_all = rbind(data1, data2)
data_new = data_all %>%
rename("Year" = "ï..YIL","University_Name" = "UNIVERSITE.ADI", "University_Type" = "UNIVERSITE.TURU", "City_Name" = "IL.ADI","Nationality" = "UYRUK", "Male" = "ERKEK", "Female" = "KADIN", "Total" = "TOPLAM" ) %>%
tbl_df()
attach(data_new)
glimpse(data_new)
## Observations: 13,160
## Variables: 8
## $ Year <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016...
## $ University_Name <fctr> ABANT IZZET BAYSAL UNIVERSITESI, ABANT IZZET ...
## $ University_Type <fctr> DEVLET, DEVLET, DEVLET, DEVLET, DEVLET, DEVLE...
## $ City_Name <fctr> BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOLU, BOL...
## $ Nationality <fctr> AFGANISTAN ISLAM CUMHURIYETI, ALMANYA FEDERAL...
## $ Male <int> 21, 1, 0, 1, 5, 2, 1, 0, 1, 1, 4, 1, 0, 2, 0, ...
## $ Female <int> 14, 3, 2, 4, 4, 0, 0, 1, 1, 0, 4, 1, 1, 0, 1, ...
## $ Total <int> 35, 4, 2, 5, 9, 2, 1, 1, 2, 1, 8, 2, 1, 2, 1, ...
summary(data_new)
## Year University_Name University_Type
## Min. :2016 ISTANBUL UNIVERSITESI : 299 : 2
## 1st Qu.:2016 ANKARA UNIVERSITESI : 253 DEVLET :9540
## Median :2017 MARMARA UNIVERSITESI : 250 VAKIF :3604
## Mean :2017 HACETTEPE UNIVERSITESI: 230 VAKIF MYO: 14
## 3rd Qu.:2017 ANADOLU UNIVERSITESI : 229
## Max. :2017 ULUDAG UNIVERSITESI : 218
## (Other) :11681
## City_Name Nationality Male
## ISTANBUL :4040 AZERBAYCAN CUMHURIYETI : 281 Min. : 0.00
## ANKARA :1565 SURIYE ARAP CUMHURIYETI : 276 1st Qu.: 1.00
## IZMIR : 880 TURKMENISTAN : 264 Median : 2.00
## ESKISEHIR: 403 IRAN ISLAM CUMHURIYETI : 237 Mean : 19.13
## KONYA : 381 IRAK CUMHURIYETI : 235 3rd Qu.: 5.00
## KOCAELI : 269 AFGANISTAN ISLAM CUMHURIYETI: 232 Max. :70926.00
## (Other) :5622 (Other) :11635
## Female Total
## Min. : 0.0 Min. : 1.00
## 1st Qu.: 0.0 1st Qu.: 1.00
## Median : 1.0 Median : 3.00
## Mean : 9.3 Mean : 28.43
## 3rd Qu.: 3.0 3rd Qu.: 8.00
## Max. :35682.0 Max. :106608.00
##
We check data if there is any NA values
which(is.na.data.frame(data_new))
na.omit(data_new)
data_last = data_new %>% filter(University_Name!= "TOPLAM")
data_last1 = data_last %>%
count(University_Type) %>%
mutate(n = n / sum(n), n = percent(n)) %>%
formattable(list(area(T, 1:2) ~ color_tile("pink", "grey")), align = 'l')
data_last1
University_Type | n |
---|---|
DEVLET | 72.50% |
VAKIF | 27.39% |
VAKIF MYO | 0.11% |
We will see the Universities have more than 500 foreign students in below
data_last = data_new %>% filter(University_Name!= "TOPLAM")
df <- data_last%>% select(Year, University_Name, City_Name, Total) %>%
filter(Total >=500) %>%
group_by(University_Name) %>%
summarise_each(funs(first(na.omit(.)))) %>% arrange(desc(Total))
head(df)
## # A tibble: 6 x 4
## University_Name Year City_Name Total
## <fctr> <int> <fctr> <int>
## 1 GAZIANTEP UNIVERSITESI 2016 GAZIANTEP 1080
## 2 KARABUK UNIVERSITESI 2017 KARABUK 927
## 3 ANADOLU UNIVERSITESI 2016 ESKISEHIR 774
## 4 MERSIN UNIVERSITESI 2017 MERSIN 741
## 5 USAK UNIVERSITESI 2016 USAK 721
## 6 TRAKYA UNIVERSITESI 2016 EDIRNE 683
Gaziantep is the first in 2016 with foreign students, it is most probably the location of the city.
We will look on the bar graph to see clearly,
df$University_Name=substr(df$University_Name,1,10)
ggplot(data=df, aes(x=reorder(University_Name, desc(Total)),y=Total,fill=City_Name))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
labs(title = "Top Universities with Foreign Students")+
labs(x = "University Name", y = "Count") +
expand_limits(y=10)
Due to location of Gaziantep, Gaziantep University is the first, most probably they are Syrian citizens. Very interestingly Karabuk is following Gaziantep.
We will see the distribution of Nationalities as per cities to see the volume in cities
df2 <- data_last %>% select(Year, City_Name ,Nationality, Total) %>%
filter(Total >=500) %>%
group_by(Nationality) %>% group_by(City_Name) %>%
summarise_each(funs(first(na.omit(.)))) %>% group_by(Nationality) %>%
arrange(desc(Total))
head(df2)
## # A tibble: 6 x 4
## # Groups: Nationality [3]
## City_Name Year Nationality Total
## <fctr> <int> <fctr> <int>
## 1 GAZIANTEP 2016 SURIYE ARAP CUMHURIYETI 1080
## 2 KARABUK 2017 SURIYE ARAP CUMHURIYETI 927
## 3 ESKISEHIR 2016 AZERBAYCAN CUMHURIYETI 774
## 4 MERSIN 2017 SURIYE ARAP CUMHURIYETI 741
## 5 USAK 2016 AZERBAYCAN CUMHURIYETI 721
## 6 EDIRNE 2016 YUNANISTAN CUMHURIYETI 683
df2$Nationality=substr(df2$Nationality,1,10)
ggplot(data=df2, aes(x=reorder(Nationality, desc(Total)),y=Total,fill=City_Name))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
labs(title = "Nationalities & Cities")+
labs(x = "Universities", y = "Count") +
expand_limits(y=10)
Selecting Nationalities
df3 <- data_last %>% select(Year, University_Name, Nationality, City_Name, Total) %>%
filter(Total >=400) %>%
group_by(University_Name) %>%
summarise_each(funs(first(na.omit(.)))) %>% arrange(desc(Total))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
head(df3)
## # A tibble: 6 x 5
## University_Name Year Nationality
## <fctr> <int> <fctr>
## 1 GAZIANTEP UNIVERSITESI 2016 SURIYE ARAP CUMHURIYETI
## 2 ANADOLU UNIVERSITESI 2016 AZERBAYCAN CUMHURIYETI
## 3 USAK UNIVERSITESI 2016 AZERBAYCAN CUMHURIYETI
## 4 TRAKYA UNIVERSITESI 2016 YUNANISTAN CUMHURIYETI
## 5 KAHRAMANMARAS SUTCU IMAM UNIVERSITESI 2017 SURIYE ARAP CUMHURIYETI
## 6 ISTANBUL UNIVERSITESI 2016 AZERBAYCAN CUMHURIYETI
## # ... with 2 more variables: City_Name <fctr>, Total <int>
I wanted to see the Nationalities dispersion by Universities,
df3$University_Name=substr(df3$University_Name,1,16)
ggplot(data=df3, aes(x=reorder(University_Name, desc(Nationality)),y=Total,fill=Nationality))+
theme(axis.text.x = element_text(angle = 90, hjust=1))+
geom_bar(stat="identity",width = 0.95)+
labs(title = "Top 6 Nationalities")+
labs(x = "Universities", y = "Count") +
expand_limits(y=10)
Conclusion : Above chart shows us that Syrian students are mostly Southern Part of Turkey, it is same for most of them except some of them as Iranian and Azerbaijan students. I did not expect several students from Azerbaijan. In general, distribution is normal acc to locations but we have some interesting cities and universities which have the foreigner students as Karabuk, Usak. Syrian and Azerbaijani students are on top.
My .RData file is available in below link;
https://github.com/MEF-BDA503/pj-MeryemKemerci/blob/master/OSYM_Final_Exam_Data.RData