Part I: Short and Simple

I never used two y-axis graphs either in my work or in my EDA. I think two y-axis graph are good at first for comparing two related features with respect to the same independent variable. In this specific example of Rjunkies, since two features plotted differ in units, it gives a general sense of trend of as normalized accident count decreases as number of departures increases over the years. Besides, it is good that they plotted one feature as line and the other as histogram. This way they donot force people to make magnitude comparision, wrongly. It would be easier to read their graph if only Rjunkies, colored these two axis with different colors or add legend, instead of printing the graph types in the title. Alternatively, one could also use dot plots to make greater or lesser difference comparisions; different from histograms, dots have relative positions [1]. But, in my opinion line graph is the best to compare two possibly related features in the same graph provided that scaling is chosen very carefully and highlighted in the graph. Field experts can easily read them and sometimes two y-axis plots cannot be avoidable especially when there is page size limit like it is in IEEE papers. When one plots two different columns with the same units but with different scales, two y-axis graphs could be misleading. As Stephen Few discusses in [1] in YTD Sales Example, dual-scaled y-axii can force user to make magnitude comparision and lead to faulty deduction. In such situations, one should simply plot two graphs seperately.
In my EDA workflow, first I try to understand each column by reading metadata. Meanwhile, I also bear in my mind the research question of interest and try to eliminate most usefull columns to answer the question. In this step I also start to build my hypothesis and wonder whether the data will surprise me or not. Then I load the data to my environment and have a quick glimpse to its size and head or tail. In this way I know the size of the data that I am dealing with and by printing first or last few lines of each column I know how values are stored in the dataframe and also their types. Later comes the data cleaning step, in which NA/NaN values are omitted or just replaced with appropriate values. Before moving to statistical analysis I may want to transfom some categorical variables to numeric ones if necessary (i.e. Yes/No to 1/0) for easy handling. If all is good I proceed the descriptive statistics step. By looking at the min and max values of each column I get an idea of how each column’s scale differ from the others and checking whether mean, median and mode values are equal or at least closer or not to understand the distribution of that column is skewed or not. I also plot each column to have better understanding of their distribution. I find drawing correlation matrix beneficial to understand how each feature is related to others. Sometimes, some columns seems to be very highly correlated to each other and I think that they are giving the same information just with different words. In a way they are dependent. I also make some pair-wise plots, they are really helpfull especially when there is some kind of grouping. If I find a such clustering this might be interesting and I focus on deeper. But before this step some filtering, data transformations and column mutations based on the research question might be necessary. For example, I could set an objective of increasing life satisfaction (obtained from self-assesment tests) or increasing GNP while preserving diversity. Then, satisfaction rate or GNP with uniform diversity becomes my impact rate. I would try to come up with a solution which maximises cumulative impact rate of all subjects. I guess I just present what the data says but if the data support my personal inclination on the subject I tend to highlt it with strong words.
Time series data is flowing data. Varibles change in time like sensorial data of IoT, whilst non-time series data is frozen in time. Auto-correlation (i.e. due to seasonal trends) issue must be taken into account while dealing with time-series analysis. This is due to dependence of two (or more) points in the series which are observed at different times, and leads to seasonality. In terms of analysis, time-series can only be similar to non-time series if time average is taken. On both types of the data prediction can be made. But with time series data prediction for the future can be made with time series analysis.
I would plot budget-rating relationship. Are very popular movies need to be shot within very large budgets? I believe not.

library(ggplot2movies)
library(tidyverse)
library(ggthemes)

Load data and have a look at it:

data <- movies %>% tbl_df 
#glimpse(data)

There are lots of NA values especially in the budget column. Let’s check for others:

sum(is.na(data))

## [1] 53573

Omit NA values:

data_woNA <- data %>% na.omit()

By omitting NA values, I lost a very large portion of the entire data but I am curious about how budget and rating is related. Is it really necessary to invest lots of money to a movie, in order to gain public’s appealing?

When I look at the descriptive statistics of movies with known budgets; I observed that budget has a skewed distribution (median is so different than the mean) and median value of the budget is 3000000$.

summary(data_woNA)

##     title                year          length           budget         
##  Length:5215        Min.   :1903   Min.   :  1.00   Min.   :        0  
##  Class :character   1st Qu.:1975   1st Qu.: 86.00   1st Qu.:   250000  
##  Mode  :character   Median :1996   Median : 97.00   Median :  3000000  
##                     Mean   :1985   Mean   : 95.99   Mean   : 13412513  
##                     3rd Qu.:2001   3rd Qu.:111.00   3rd Qu.: 15000000  
##                     Max.   :2005   Max.   :390.00   Max.   :200000000  
##      rating           votes              r1               r2        
##  Min.   : 1.000   Min.   :     5   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 5.200   1st Qu.:    67   1st Qu.: 4.500   1st Qu.: 4.500  
##  Median : 6.300   Median :   612   Median : 4.500   Median : 4.500  
##  Mean   : 6.141   Mean   :  4974   Mean   : 7.842   Mean   : 4.805  
##  3rd Qu.: 7.200   3rd Qu.:  4642   3rd Qu.: 4.500   3rd Qu.: 4.500  
##  Max.   :10.000   Max.   :157608   Max.   :94.500   Max.   :44.500  
##        r3               r4               r5               r6       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 4.500   1st Qu.: 4.500   1st Qu.: 4.500   1st Qu.: 4.50  
##  Median : 4.500   Median : 4.500   Median : 4.500   Median :14.50  
##  Mean   : 4.959   Mean   : 5.892   Mean   : 8.411   Mean   :11.99  
##  3rd Qu.: 4.500   3rd Qu.: 4.500   3rd Qu.:14.500   3rd Qu.:14.50  
##  Max.   :45.500   Max.   :64.500   Max.   :64.500   Max.   :64.50  
##        r7              r8              r9              r10        
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 4.50   1st Qu.: 4.50   1st Qu.: 4.500   1st Qu.:  4.50  
##  Median :14.50   Median :14.50   Median : 4.500   Median : 14.50  
##  Mean   :15.49   Mean   :14.33   Mean   : 9.661   Mean   : 17.65  
##  3rd Qu.:24.50   3rd Qu.:24.50   3rd Qu.:14.500   3rd Qu.: 24.50  
##  Max.   :64.50   Max.   :84.50   Max.   :84.500   Max.   :100.00  
##      mpaa               Action         Animation           Comedy     
##  Length:5215        Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000  
##  Mode  :character   Median :0.0000   Median :0.00000   Median :0.000  
##                     Mean   :0.1607   Mean   :0.02416   Mean   :0.336  
##                     3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.000  
##                     Max.   :1.0000   Max.   :1.00000   Max.   :1.000  
##      Drama         Documentary         Romance           Short        
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.4531   Mean   :0.02435   Mean   :0.1478   Mean   :0.08514  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000

Now, let’s arrange data with respect to rating score in descending order. In this way we can list top 10 movies with known budget.

data_woNA %>% arrange(desc(rating)) %>% select(title,budget, rating)

Let’s define “very good movie” as a movie whose rating is greater or equal to 9.

goodMovies <- data_woNA %>% filter(rating>=9)

Since budget or rating can depend on the genre, let’s plot ratings of best movies with respect to budget for different genres.

goodAction <- goodMovies %>% filter(Action==1)%>% mutate(Genre="Action")
goodAnimation <- goodMovies %>% filter(Animation==1)%>% mutate(Genre="Animation")
goodComedy <- goodMovies %>% filter(Comedy==1)%>% mutate(Genre="Comedy")
goodDrama <- goodMovies %>% filter(Drama==1)%>% mutate(Genre="Drama")
goodDocumentary <- goodMovies %>% filter(Documentary==1)%>% mutate(Genre="Documentary")
goodRomance <- goodMovies %>% filter(Romance==1)%>% mutate(Genre="Romance")
goodShort <- goodMovies %>% filter(Short==1)%>% mutate(Genre="Short")
#bind all the dataframes
allGood <- rbind(goodAction,goodAnimation,goodComedy,goodDrama,goodDocumentary,goodRomance,goodShort)
#remove duplicates
allGood <- allGood[!duplicated(allGood),]

After those necessary transformations, now we are ready to plot the results.

ggplot(allGood,aes(x=budget, y=rating, fill=Genre)) +
geom_boxplot(alpha=0.4) +
scale_x_log10() +
ggtitle("How Much Spent on Very Popular Movies?") +
labs(y="Rating Score", x="Budget($)") +
theme_solarized_2(light=FALSE) +scale_color_solarized("red")

Recall that median value of the budget column was 3000000$ and x-axis of the plot above is in log scale. Thus, the graph must be read accordingly. Action and drama movies have widest budget range of all while action movies are shot for more expensive budgets than drama. This is no surprise since there are lots of special effects in action movies. But they pay their price at the end and had an median rating score of 9.75, greater than all and probably with highhest box-office returns. Since action movies budget range is widest, I think that their high rating score is not so dependent to the budget. Even movies with medium-low budgets have a possibility of becoming super popular, if their genre is action. May be people tend to vote action movies with higher scores. Romance and short movies share the same median rating score greater than 9.375 while romance is shot for more expensive budget range and short movies have a widest rating score range. Interestingly, shooting an popular animation movie requires least budget and they have a high median score of 9.5 with a narrow range.

In conclusion, if a producer aims to shoot a movie with a very high popularity and has no budget limit, he/she should invest on action movies; if budget is a constraint then he/she should invest on animation film of a studios with talented and creative designers.

Part II: Extending Your Group Project (30 pts)

In my opinion, choosing a intutive cluster number as 3 in our project was a flaw. According to our pair-wise scatter plots we identified three groups that is why we choose k=3 while implementing k-means model. Now, I want to strenghten that point.

Let’s start with importing requires libaries.

library(tidyverse)
library(ggthemes)
library(formattable)

library(GGally)
library(ggalt)
library(party)
library(rpart)
library(rpart.plot)
library(pROC)

The data is read online and transformed to tibble data frame. I want to focus on employees left the company, so data is filtered accordingly.

d<-read.csv("https://raw.githubusercontent.com/MEF-BDA503/gpj-2yaka/master/HR_comma_sep.csv"
            ,header=TRUE, sep=",")
d<- d %>% rename("departments" = "sales") %>% tbl_df()

#Filter to get the data from left people only
dLeft <-d %>% filter(left==1)
#Only numeric columns are included
dLeft_num <- d %>% filter(left==1) %>% select(-left,-departments,-salary)

Another thing that was missing in our project report was comparision of descriptive statistics of the whole data and people left.

summary(d)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years      departments      salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

summary(dLeft)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.4500   Min.   :2.000   Min.   :126.0       
##  1st Qu.:0.1300     1st Qu.:0.5200   1st Qu.:2.000   1st Qu.:146.0       
##  Median :0.4100     Median :0.7900   Median :4.000   Median :224.0       
##  Mean   :0.4401     Mean   :0.7181   Mean   :3.856   Mean   :207.4       
##  3rd Qu.:0.7300     3rd Qu.:0.9000   3rd Qu.:6.000   3rd Qu.:262.0       
##  Max.   :0.9200     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident          left   promotion_last_5years
##  Min.   :2.000      Min.   :0.00000   Min.   :1   Min.   :0.000000     
##  1st Qu.:3.000      1st Qu.:0.00000   1st Qu.:1   1st Qu.:0.000000     
##  Median :4.000      Median :0.00000   Median :1   Median :0.000000     
##  Mean   :3.877      Mean   :0.04733   Mean   :1   Mean   :0.005321     
##  3rd Qu.:5.000      3rd Qu.:0.00000   3rd Qu.:1   3rd Qu.:0.000000     
##  Max.   :6.000      Max.   :1.00000   Max.   :1   Max.   :1.000000     
##                                                                        
##      departments      salary    
##  sales     :1014   high  :  82  
##  technical : 697   low   :2172  
##  support   : 555   medium:1317  
##  IT        : 273                
##  hr        : 215                
##  accounting: 204                
##  (Other)   : 613

Mean satisfaction level of employees left is 44% which is way lower than company’s overall average of 61%; median of evaluation score of people left is slightly higher than company’s median, number of project donot differ significantly; median of working hours is lower for people left so there must be a group of people not so committed to work; people quitted works for lower years; accident count is lower for the people left so it cannot be factor effects the resignation decision.

Moreover, columns are distributes as follows:

Notice that, average monthly hours, evaluation score and number of projects’ distributions are strongly bimodal. Other than that satisfaction level of the employees are localizen on three distinct values. Nearly, none of them has got promoted and work accidents are rare.

K-Means model:

In k-means at first each point is randomly assigned to a cluster. Then centers are calculated as average positions of each subgroups. Then each point in the set is assigned to the cluster with nearest center. Iteratively center positions are changed and at each time points are re-assigned again till no point changes its positions anymore. Total within cluster sum of squares is used as a measure of performance of the algorithm. It should be minimized for best solution and running algorithm for several times helps to fing a better global minimum. Elbow technique can be used to determine the optimal number of clusters. In this method, we plot total within sum of squares with respect to differing number of clusters and then deduce where this measure becomes to be stabilized (i.e. elbow point). Besides that threshold, there is no meaningfull decrease in total within sum of squares.

I run kmeans algorithm up to 15 centers with 30 initial starting points (best is taken) and with an iteration limit of 50 to guarantiee the convergence.

set.seed(42)
# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(dLeft_num, centers = i, nstart=30, iter.max = 50)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
ggplot() + geom_line(aes(y = wss, x = seq(1, 15))) +
  geom_point(aes(y = wss, x = seq(1, 15))) + theme_solarized_2(light=FALSE) +
  scale_color_solarized("white") +
  labs(y="Within Groups Sum of Squares",x="Number of Clusters", title="Scree Plot")

From the graph above elbow position seems to be at k=3, so our prior choice was a wise choice. Locaiton of the cluster centers are listed below.

set.seed(42)
km.out <- kmeans(dLeft_num,centers=3, nstart = 30, iter.max = 50)
km.out$centers

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1          0.6100000       0.8891911       4.936889             243.6409
## 2          0.4159149       0.5306140       2.178723             144.8322
## 3          0.2511361       0.8628964       5.780275             285.0799
##   time_spend_company Work_accident promotion_last_5years
## 1           4.778667    0.05333333          0.0008888889
## 2           3.071733    0.04741641          0.0091185410
## 3           4.262172    0.03870162          0.0037453184

These results of cluster centers are consistent with our prior analysis.

Now, I will try to catch distinct clusters in each colum.

p1 <- ggplot(dLeft_num, aes(y = satisfaction_level, 
  x = seq(1, length(satisfaction_level)),col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(y="Score",x="Index", col="Clusters", title ="Satisfaction", size=8)

p2 <- ggplot(dLeft_num, aes(y = last_evaluation,
  x = seq(1, length(last_evaluation)),col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(y="Score",x="Index", col="Clusters", title ="Evaluation", size=8)

p3 <- ggplot(dLeft_num, aes(y = number_project, 
  x = seq(1, length(number_project)),col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(y="Count",x="Index", col="Clusters", title ="Projects", size=8)
 
p4 <- ggplot(dLeft_num, aes(y = average_montly_hours, 
  x = seq(1, length(average_montly_hours)),col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(y="Monthly Hours",x="Index", col="Clusters", title ="Working Hours", size=8)

p5 <- ggplot(dLeft_num, aes(y = time_spend_company, 
  x = seq(1, length(time_spend_company)),col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(y="Time(years)",x="Index", col="Clusters", title ="Working Years", size=8)

multiplot(p1,p2,p3,p4,p5,cols=2)

In the graphs above columns with contious numerical values are plotted and colored according to k-Means clusters, subgroups are well seperated in job satisfaction, monthly working hours and time spent in the company and partly seperared in number of projects as seen above. Evaluation score of Cluster 3 and 1 seems to be merged. Since evaluation score is given by company and nearly none is promoted in the last five years based on this score, it might not affect the employees decision to quit.

clusterPlot1 <- ggplot(dLeft_num, aes( satisfaction_level, 
  time_spend_company,col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(x="Satisfaction Level",y="Time Spend In Company(Years)", col="Clusters") 

clusterPlot2 <- ggplot(dLeft_num, aes(satisfaction_level,
  average_montly_hours,col = factor(km.out$cluster))) +
  geom_point() + theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  labs(x="Satisfaction Level",y="Average Monthly Hours(h)", col="Clusters") 

multiplot(clusterPlot1,clusterPlot2,cols = 2)

Finally, when I coloured our good old scatter plots according to cluster types, I managed to identify same solid blocks in monthly working hours and verly poorly for total years spent in the company (just for cluster 1 and 2 the previous behaviour is apparent). I donot know why k-Means fully seperare clusters 1 and 3 in monthly hours. At first I thought, the iteration might not fully converged so I raised iter.max parameter from 50 to 500 but nothing has changed. Probably, features coming from other columns causing this. One solution could be not giving all the numerical columns but just the relevant ones to the model. I will study this later.

Part III: Welcome to Real Life (50 pts)

In this part I want to study gender disparity in the academic faculty, and how it is changed over time. Thus, from Higher Education Council’s data service I collected Academician Counts for all universities of years 2016, 2015 and 2014. They were initially in .xls format so I saved them as .csv.

academy2016 <- read.csv("Akademisyen Sayıları2016.csv", sep = ";") %>%
  tbl_df  %>% mutate(Year=2016) 

academy2015 <- read.csv("Akademisyen Sayıları2015.csv", sep = ";") %>% 
  tbl_df  %>% mutate(Year=2015) 

academy2014 <- read.csv("Akademisyen Sayıları2014.csv", sep = ";") %>% 
  tbl_df  %>% mutate(Year=2014)

academy <-rbind(academy2016,academy2015,academy2014) %>% 
  rename("University" = "Üniversite.Adı","Male_Prof"="Profesör.Erkek", 
         "Female_Prof"="Profesör.Kadın", "Total_Prof" ="Profesör.Top.",
         "Male_Assoc_Prof" = "Doçent.Erkek", 
         "Female_Assoc_Prof" = "Doçent.Kadın",
         "Total_Assoc_Prof" ="Doçent.Top.", 
         "Male_Asst_Prof"="Yardımcı.Doçent.Erkek",
         "Female_Asst_Prof"= "Yardımcı.Doçent.Kadın",
         "Total_Asst_Prof" = "Yardımcı.Doçent.Top.", 
         "Male_Instructor"="Öğretim.Görevlisi.Erkek", 
         "Female_Instructor"="Öğretim.Görevlisi.Kadın", 
         "Total_Instructor"="Öğretim.Görevlisi.Top.",
         "Male_Language_Ins"="Okutman.Erkek",
         "Female_Language_Ins"= "Okutman.Kadın",
         "Total_Language_Ins"="Okutman.Top.",
         "Male_Specialist"="Uzman.Erkek", 
         "Female_Specialist"="Uzman.Kadın",
         "Total_Specialist" ="Uzman.Top.", 
         "Male_Research_Asst" ="Araştırma.Görevlisi.Erkek",
         "Female_Research_Asst"= "Araştırma.Görevlisi.Kadın",
         "Total_Research_Asst"="Araştırma.Görevlisi.Top.", 
         "Male_Translator"="Çevirici.Erkek",
         "Female_Translator"="Çevirici.Kadın", 
         "Total_Translator"="Çevirici.Top.",
         "Male_Edu_Planner"="Eğt..Öğr..Plan..Erkek",
         "Female_Edu_Planner"="Eğt..Öğr..Plan..Kadın", 
         "Total_Edu_Planner" = "Eğt..Öğr..Plan..Top.", 
         "Male_Grand_Total"="Genel.Toplam.Erkek",
         "Female_Grand_Total"="Genel.Toplam.Kadın", 
         "Grand_Total"="Genel.Toplam.Top.")

The dataframe is saved as .Rdata file as follows:

save(academy,file="Academia.RData")

I included .Rdata to my GitHub repository. Since it is available online, now I can load .Rdata for further analysis.

download.file("https://github.com/MEF-BDA503/pj-cand/raw/master/files/Academia.RData","data.RData")
data <- get(load("data.RData")) %>% tbl_df

Dimensions of this set are 554x32.

dim(data)

## [1] 554  32

I think column names are self-explanatory but for the sake of completeness they are explained below.

University: Name of the university
Male_Prof: Number of male proffessors in the institution
Female_Prof: Number of female proffessors in the institution
Total_Prof: Total number of proffessors in the institution
Male_Assoc_Prof: Number of male associate professors in the institution
Female_Assoc_Prof: Number of female associate professors in the institution
Total_Assoc_Prof: Total number of associate professors in the institution
Male_Asst_Prof: Number of male assistant professors in the institution
Female_Asst_Prof: Number of female assistant professors in the institution
Total_Asst_Prof: Total number of assistant professors in the institution
Male_Instructor: Number of male instructors in the institution
Female_Instructor: Number of female instructors in the institution
Total_Instructor: Total number of instructors in the institution
Male_Language_Ins: Number of male language instructors in the institution
Female_Language_Ins: Number of female language instructors in the institution
Total_Language_Ins: Number of total language instructors in the institution
Male_Specialist: Number of male specialists in the institution
Female_Specialist: Number of female specialists in the institution
Total_Specialist: Total number of specialists in the institution
Male_Research_Asst: Number of male research assistansts in the institution
Female_Research_Asst: Number of female research assistansts in the institution
Total_Research_Asst: Number of total research assistansts in the institution
Male_Translator: Number of male translators in the institution
Female_Translator: Number of female translators in the institution
Total_Translator: Total number of translators in the institution
Male_Edu_Planner: Number of male education and training planner in the institution
Female_Edu_Planner: Number of female education and training planner in the institution
Total_Edu_Planner: Total number of female education and training planner in the institution
Male_Grand_Total: Number of male academic staff in the institution
Female_Grand_Total: Number of female academic staff in the institution
Grand_Total: Number of total academic staff in the institution

Descriptive statistics of female and male academic staff is below.

summary(data)

Let’s calculate ratios of female representation in all academic ranks.

Female_rates<-data %>%
  transmute(University, Prof_rate=Female_Prof/Total_Prof, 
         Assoc_Prof_rate=Female_Assoc_Prof/Total_Assoc_Prof, 
         Asst_Prof_rate = Female_Asst_Prof/Total_Asst_Prof, 
         Instructor_rate=Female_Instructor/Total_Instructor, 
         Language_Ins_rate = Female_Language_Ins/Total_Language_Ins,
         Specialist_rate = Female_Specialist/Total_Specialist, 
         Research_Asst_rate=Female_Research_Asst/Total_Research_Asst,
         Translator_rate=Female_Translator/Total_Translator, 
         Edu_Planner_rate = Female_Edu_Planner/Total_Edu_Planner,
         Total_rate=Female_Grand_Total/Grand_Total, Year)

The NA values is occured probably due to division by zero, because of the positions with total count 0, thus replacing them with 0 makes sense (since those positions are already empty).

Female_rates[is.na(Female_rates)] <- 0
summary(Female_rates)

##                                  University    Prof_rate     
##  ABANT İZZET BAYSAL ÜNİVERSİTESİ      :  3   Min.   :0.0000  
##  ABDULLAH GÜL ÜNİVERSİTESİ            :  3   1st Qu.:0.1000  
##  ACIBADEM ÜNİVERSİTESİ                :  3   Median :0.1932  
##  ADANA BİLİM VE TEKNOLOJİ ÜNİVERSİTESİ:  3   Mean   :0.1986  
##  ADIYAMAN ÜNİVERSİTESİ                :  3   3rd Qu.:0.2825  
##  ADNAN MENDERES ÜNİVERSİTESİ          :  3   Max.   :1.0000  
##  (Other)                              :536                   
##  Assoc_Prof_rate  Asst_Prof_rate   Instructor_rate  Language_Ins_rate
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.1667   1st Qu.:0.3115   1st Qu.:0.3241   1st Qu.:0.4091   
##  Median :0.2670   Median :0.3860   Median :0.4195   Median :0.5756   
##  Mean   :0.2776   Mean   :0.3882   Mean   :0.4376   Mean   :0.5543   
##  3rd Qu.:0.3932   3rd Qu.:0.4813   3rd Qu.:0.5556   3rd Qu.:0.7362   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##                                                                      
##  Specialist_rate  Research_Asst_rate Translator_rate   Edu_Planner_rate 
##  Min.   :0.0000   Min.   :0.0000     Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.4137     1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.3333   Median :0.4956     Median :0.00000   Median :0.00000  
##  Mean   :0.3022   Mean   :0.4830     Mean   :0.06679   Mean   :0.02876  
##  3rd Qu.:0.5000   3rd Qu.:0.5714     3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000     Max.   :1.00000   Max.   :1.00000  
##                                                                         
##    Total_rate          Year     
##  Min.   :0.0000   Min.   :2014  
##  1st Qu.:0.3564   1st Qu.:2014  
##  Median :0.4249   Median :2015  
##  Mean   :0.4312   Mean   :2015  
##  3rd Qu.:0.5103   3rd Qu.:2016  
##  Max.   :0.7347   Max.   :2016  
##

As seen from the statistics above, academic representation of the females in higher academic ranks is lower. It is 20% for proffessorship, 28% for associate proffessorship, 39% for associate proffessorship. Representation rate decreases as females climbs high. These all positions require Ph.D. Interestingly female representation rate for reseach assistantship is close to males with 48%. Graduate students who are most probably willing to pursue their career in academia occupies research assistanship positions. So why those motivated females cannot survive or stuck in the same rank or may be just not willing to be in academia in their later careers? To answer this question I need demographic data or self-assesment reports. For the time being I can only analyze representation rates with respect to academic positions.

Female proffessor rates are plotted against universities. Universities are ordered from most diverse to least in gender of academic staff. The ones on the left of x-axis are the ones with highest female academic staff in numbers and the ones on the right are vice versa.

prof2016 = ggplot(data = Female_rates %>% filter(Year == 2016),
  aes(x = reorder(University, -Total_rate), y = Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Proffessor Rate", title="2016" )
prof2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
                  aes(x = reorder(University, -Total_rate), y = Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
prof2014 = ggplot(data = Female_rates %>% filter(Year == 2014), 
                  aes(x = reorder(University, -Total_rate), y = Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1)
multiplot(prof2016, prof2015, prof2014, cols = 3 )

Female proffessor population rates are high in more gender diverse universities. There is appearent one outlier for the year 2016, three for 2015 and two or more for 2014. Those peaks behave different than the university’s general tendency. Other than those outlies general trend is not changed over three years. This is expected since there is only one year between the dates of comparision, and gaining a title in acedemia requires more (for example at least four years for proffessorship provided that the academician of interest is an associate proffessor already). That is why we cannot expect to see abrupt changes over last three years.

Female associate proffessor rates are plotted against universities ordered by descending diversity.

assocprof2016 = ggplot(data = Female_rates %>% filter(Year == 2016), 
  aes(x = reorder(University, -Total_rate), y = Assoc_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Associate Proffessor Rate", title="2016" )
assocprof2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
  aes(x = reorder(University, -Total_rate), y = Assoc_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
assocprof2014 = ggplot(data = Female_rates %>% filter(Year == 2014),
  aes(x = reorder(University, -Total_rate), y = Assoc_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1)
multiplot(assocprof2016, assocprof2015, assocprof2014, cols = 3 )

Again more gender diverse universities are the ones with higher female associate proffessor rate, but the slope declines slower when compared to female proffessor rates and thus the tails are fatter. Interestingly we have more outliers. So, we can deduce that for associate proffessorship positions less diverse universities tend to hire female staff opposing to their general tendencies.

Similarly female assistant proffessor rates are plotted against universities.

asstprof2016 = ggplot(data = Female_rates %>% filter(Year == 2016), 
  aes(x = reorder(University, -Total_rate), y = Asst_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Associate Proffessor Rate", title="2016" )+
  expand_limits(y=1)
asstprof2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
  aes(x = reorder(University, -Total_rate), y = Asst_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
asstprof2014 = ggplot(data = Female_rates %>% filter(Year == 2014), 
  aes(x = reorder(University, -Total_rate), y = Asst_Prof_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1)
multiplot(asstprof2016, asstprof2015, asstprof2014, cols = 3 )

For assistant proffessorship diversity trend is not changed but over all representation rates become higher. The slope become slightly milded and thus the tails become mildly fatter. So, as a female assistant proffessor it is more probable to exist in even less diverse universite than the higher academic ranks. Accross the years 2016 is the most promissing for the women in science.

Female assistant proffessor rates are also plotted against universities from most gender diverse to least.

RA2016 = ggplot(data = Female_rates %>% filter(Year == 2016), 
  aes(x = reorder(University, -Total_rate), y = Research_Asst_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Research Assistant Rate", title="2016" )
RA2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
  aes(x = reorder(University, -Total_rate), y = Research_Asst_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
RA2014 = ggplot(data = Female_rates %>% filter(Year == 2014), 
  aes(x = reorder(University, -Total_rate), y = Research_Asst_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1)
multiplot(RA2016, RA2015, RA2014, cols = 3 )

In the plots above there is an significant improvement for the case of female research assistants, the rates are increased. The fluctuations in the data become more obvious. We can calmly state that even less diverse universities are happy to wellcome female RAs or female RAs might choose to study at less diverse universities.

Instructor rates are plotted against from most diverse universities to least.

ins2016 = ggplot(data = Female_rates %>% filter(Year == 2016), 
  aes(x = reorder(University, -Total_rate), y = Instructor_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Instructor Rate", title="2016" )
ins2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
  aes(x = reorder(University, -Total_rate), y = Instructor_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
ins2014 = ggplot(data = Female_rates %>% filter(Year == 2014), 
  aes(x = reorder(University, -Total_rate), y = Instructor_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1)
multiplot(ins2016, ins2015, ins2014, cols = 3 )

Instructor rates behaves like RA rates in trend but with a slimmer tails. This is interesting for me becuse I expect to see women in lower ranks more probable. So as an female RA one has more chance to exist in a least gender diverse university. This might because instructor positions are permanent but RAs depends on yearly contracts (in state universities, it is a scholarship position for private universities). So least diverse universites could be calmly hire female RAs but not instructors. When they do this they donot behave other than their natural policy.

Female language instructor rates are plotted against universities ordered by gender diversity.

langIns2016 = ggplot(data = Female_rates %>% filter(Year == 2016), 
  aes(x = reorder(University, -Total_rate), y = Language_Ins_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+
  labs(x= "Universities(From Most \n Diverse to Least)",
  y="Female Language Instructor Rate", title="2016" )
langIns2015 = ggplot(data = Female_rates %>% filter(Year == 2015), 
  aes(x = reorder(University, -Total_rate), y = Language_Ins_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +labs(x= "", y="", title=2015)+
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="") +
  expand_limits(y=1)
langIns2014 = ggplot(data = Female_rates %>% filter(Year == 2014), 
  aes(x = reorder(University, -Total_rate), y = Language_Ins_rate)) +
  theme_solarized_2(light=FALSE) +scale_color_solarized("red") +
  theme(axis.text.x =element_blank() ) +
  geom_bar(stat="identity",position = "stack")+labs(x= "\n", y="", title=2014) +
  expand_limits(y=1) 
multiplot(langIns2016, langIns2015, langIns2014, cols = 3 )

As a language instructor there is more chance for a woman to find a position at the universities. This might be because humaties and language departments are mostly populated by women. I guess this is not a result of university policy women choose to be more existent in such fields. Besides general trend do not change over the years.

In conclusion women are represented lower in higher academic ranks. I am stopping here but for further analysis regression line slopes could be compared. Furthermore, a thereshold representation rate can be set and one can be focus on the universities that lies above that measure. I think that the outlier universities might be interesting.

BDA 505 Final Exam, MEF University

Duygu Can

Part I: Short and Simple

Part II: Extending Your Group Project (30 pts)

K-Means model:

Part III: Welcome to Real Life (50 pts)

References