We start by loading required libraries for this study.
library(tidyverse)
library(ggplot2)
library(ggcorrplot)
library(ggthemes)
library(formattable)
library(htmlwidgets)
library(ggalt)
library(party)
library(rpart)
library(rpart.plot)
library(pROC)
The dataset we have choosen is about human resources. Our aim is to answer an interesting question of a company such as “Why are our best and most experienced employees leaving prematurely?” We will try to find an answer to this question by analyzing answers of the employees to the job satisfaction survey and their work related records. The dataset is formed by the Human Resources (HR) department after conducting a survey on their employees. In this study we first run an Explanatory Data Analysis (EDA) on tha data to make it more meaningfull, then we applied Principal Component Analysis (PCA) to understand the factors that cause employess to leave. With the help of k-Means clustering we obtained three different groups of employees that left with similar reasons and then employed a decision three analysis to predict the employees that will possibly leave the company in the future.
We used Human Resources Analytics Data from kaggle. This HR data set is obtained from the results of a satisfaction survey the company has carried out on their employees in combination with other HR related records. It consists of 14999 rows and 10 columns. Each row is dedicated for a different employee. Out of 10, 8 columns are in numeric type, while the remaining 2 are in numeric values. Below you can find columns and their explanations, respectively.
1st Column: Satisfaction level
2nd Column: Last evaluation score
3rd Column: Number of projects worked on (yearly basis)
4th Column: Average monthly working hours
5th Column: Time spent in the company (in years)
6th Column: Whether they have had a work accident in the last 2 years
7th Column: Whether they have had a promotion in the last 5 years
8th Column: Departments
9th Column: Salary
10th Column: Whether the employee has left
All the data collected is from last 5 years whereas accident data belongs to past 2 years. This HR database does not take into account the employees that have been fired, transferred or hired in the past year. Our objective is to make predictions about the probabilities that employees may leave their company and what to change to increase their satisfaction levels. We will try to give insights to make best employees more loyal.
Then we read dataset from csv file and had a quick look at its form.
d=read.csv("HR_comma_sep.csv")
d<- d %>% rename("departments" = "sales") %>% tbl_df()
glimpse(d)
## Observations: 14,999
## Variables: 10
## $ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ departments <fctr> sales, sales, sales, sales, sales, sale...
## $ salary <fctr> low, medium, medium, low, low, low, low...
As you can see we have 14999 observations for job satisfaction level, latest evaluation (yearly), number of projects worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary. All of them are numeric values except last two columns.
We also checked whether there is any NA or NaN values.
which(is.na.data.frame(d))
## integer(0)
Since there aren’t any NA/NaN values, no need to omit or complete them. Thus we proceed summarizing the data.
summary(d)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years departments salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
On average, the employees are 64% satisfied with their jobs; they have mean performance score of 0.71 (out of 1); an ordinary employee worked roughly on 3 - 4 different projects per year; it takes them 201.1 hours to complete their jobs per month; an avarage employee worked almost 3.5 years in the company (min = 2, max = 10); attrition rate is 23.81%; unfortunately only 2.1% of employees were promoted in the last 5 years. 8.24% of the employees are highly paid, 48.78% is lowly paid, 42.98% gets medium wage.
Here the employee’s salary ranges can be seen in more detail:
d %>%
count(salary) %>%
formattable(align = 'l')
salary | n |
---|---|
high | 1237 |
low | 7316 |
medium | 6446 |
We can see clearly that the range of salaries are not well balanced. Only 8% of the employees are paid highly, while 48% of them are in the low salary range.
Departmentwise employee count is as follows:
d %>%
count(departments) %>%
formattable(align = 'l')
departments | n |
---|---|
accounting | 767 |
hr | 739 |
IT | 1227 |
management | 630 |
marketing | 858 |
product_mng | 902 |
RandD | 787 |
sales | 4140 |
support | 2229 |
technical | 2720 |
As you can see Sales Department is the largest and Management is the smallest in size.
Departmentwise salary frequencies are shown below:
d %>%
count(departments, salary) %>% group_by(departments) %>%
mutate(
salary = ordered(salary, c("low", "medium", "high")),
n = percent(n, 0),
n = n/sum(n)
) %>%
spread(salary, n) %>%
formattable(list(area(T, 2:4) ~ color_tile("grey", "pink")), align = 'l')
departments | low | medium | high |
---|---|---|---|
accounting | 0.46675359 | 0.43676662 | 0.09647979 |
hr | 0.45331529 | 0.48579161 | 0.06089310 |
IT | 0.49633252 | 0.43602282 | 0.06764466 |
management | 0.28571429 | 0.35714286 | 0.35714286 |
marketing | 0.46853147 | 0.43822844 | 0.09324009 |
product_mng | 0.50000000 | 0.42461197 | 0.07538803 |
RandD | 0.46251588 | 0.47268107 | 0.06480305 |
sales | 0.50700483 | 0.42801932 | 0.06497585 |
support | 0.51413190 | 0.42261104 | 0.06325707 |
technical | 0.50441176 | 0.42169118 | 0.07389706 |
As it is better visualised in the table above, only Management Department is well balanced in terms of salary. Other than that the rest of the departments are paid in the lower and medium wage nearly equally. Since nearly all of the employess work for low and medium wage, salary alone cannot be a dominant factor for the decision of quiting.
Now, having known all the column names and their related statistics, we wanted to visualize the distribution of the numerical ones, just to get better understanding for further interpretations. For this reason we employed histogram plots. Notice that bin sizes are adapted with respect to varying ranges of the data.
p1 <- ggplot(subset(d ), aes(x=satisfaction_level, colour=satisfaction_level)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_histogram(binwidth = 0.05,color = 'black', fill = '#999999') +
ggtitle("Satisfaction Level") +
labs(x="Job Satis. Score", y="Employee Count", size = 8)
p2 <- ggplot(subset(d ), aes(x=last_evaluation, colour=last_evaluation)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_histogram(binwidth = 0.05,color = 'black', fill = '#009E73') +
ggtitle("Last Evaluation") +
labs(x="Performance Score", y="Employee Count", size = 8)
p3 <- ggplot(subset(d ), aes(x=number_project, colour=number_project)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_histogram(binwidth = 1,color = 'black', fill = '#56B4E9') +
ggtitle("Yearly Projects") +
labs(x="Number of Projects", y="Employee Count", size = 8)
p4 <- ggplot(subset(d ), aes(x=average_montly_hours, colour=average_montly_hours)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_histogram(binwidth = 10,color = 'black', fill = '#E69F00') +
ggtitle("Average Monthly Hours")+
labs(x="Monthly Hours", y="Employee Count", size = 8)
p5 <- ggplot(subset(d ), aes(x=time_spend_company, colour=time_spend_company)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_histogram(binwidth = 0.5,color = 'black', fill = '#F0E442') +
ggtitle("Time Spent in Company") +
labs(x="Time (Years)", y="Employee Count", size = 8)
p6 <- ggplot(subset(d ), aes(x=Work_accident, colour=Work_accident)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_bar(color = 'black', fill = '#0072B2')+ ## +
ggtitle("Work Accidents") +
labs(x="Accident Count", y="Employee Count", size = 8) +
scale_x_continuous(breaks = c(0,1,1))
p7 <- ggplot(subset(d ), aes(x=left, colour=left)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
geom_bar(color = 'black', fill = '#D55E00') +
ggtitle("Employees Left") +
labs(x="Left Count", y="Employee Count", size = 8) +
scale_x_continuous(breaks = c(0,1,1))
p8 <- ggplot(subset(d), aes(x=promotion_last_5years, colour=promotion_last_5years)) +
theme_economist() +
scale_color_economist()+
theme(plot.title = element_text(size=12)) +
scale_x_continuous(breaks = c(0,1,1)) +
geom_bar(color = 'black', fill = '#CC79A7') +
ggtitle("Promotion in 5 Years") +
labs(x="Promotion Count", y="Employee Count", size = 8)
multiplot(p1,p2,p3,p4,p5,p6,p7,p8,cols=3)
According to above histograms, we found that; there was nearly no promoted employee in the last five years. Bear in mind that the work accident data belongs to last two years and frequency of accidents is low but not zero. Monthly time spent in the company and number of projects which each employee works on both have positively skewed distributions. So, instead of mean, median value must be considered. Thus, a common employee works 4 projects per year, spending 200 hours per month. Average monthly hours and last evauation score histograms are bimodal. There are two main group of workers in this company: one works for short (~150 h) on works for very long (~260 h) with medium (~0.55) and high evaluation scores (~0.85).
Let’s have a look at departmentwise job satisfaction level of the employees.
dep_satis <- d %>%
group_by(departments) %>%
summarise (Satisfaction =mean(satisfaction_level))
ggplot(data=dep_satis, aes(x=departments, y=Satisfaction)) +
theme_economist() +
scale_color_economist()+
theme(axis.text.x = element_text(angle = 90, hjust=0.5),axis.text = element_text(size=10) )+
geom_bar(stat="identity", position=position_dodge(), fill="pink", colour="black")+
labs(x= "Departments", y="Job Satisfaction Level")
It appears that the level of satisfaction is pretty much the same for all the departments except for some unhappy folks in the accounting department. While rest of the departments average satisfaction value is around 60% (which complies with overall satisfaction rate of 61.28%), accounting department’s mean is below the company’s mean.
depLeftSatis <- d %>% group_by(departments) %>% summarise(Total =n(),LeftCount = sum(left), Satisfaction = mean(satisfaction_level) ) %>% mutate(LeftRate = LeftCount/Total)
ggplot(data = depLeftSatis, aes( x = reorder(departments, -LeftRate), y = LeftRate, fill = Satisfaction)) +
theme_economist() +
scale_color_economist()+
theme(axis.text.x = element_text(angle = 90, hjust=0.5))+
geom_col()+labs(x= "Departments", y="Left Rate")
Among all the departments, HR is the one with highest attrition rate, then follows Accounting. The top three departments are also the most unsatisfied ones in which Accounting peaks in unhappiness. From the graph above we deduced that Satisfaction Level is a pretty important factor in quitting the company.
We also replotted the graph above by filling each bar according to the salary distribution of the department of interest. The salary imbalance within the departments as well as the whole company is apperent again.
depLeftSatis <- d %>% group_by(departments,salary) %>% summarise(Total =n(),LeftCount = sum(left), Satisfaction = mean(satisfaction_level) ) %>% mutate(LeftRate = LeftCount/Total)
ggplot(data = depLeftSatis, aes( x = reorder(departments, -LeftRate), y = LeftRate, fill=salary)) +
theme_economist() +
scale_color_economist()+
theme(axis.text.x = element_text(angle = 90, hjust=0.5)) +
geom_bar(stat="identity",position = "stack")+labs(x= "Departments", y="Left Rate")
Notice that altough job satisfaction level is uniformly distributed among the departments, the left rate is not, it is maximized at the HR department. So there must be other factors that effect employee’s decision to sign-out, such as long working hours.
Now, let’s look at the categorical variables; salary and departments. We wanted to compare the resigned and the remained employees by departments and salary categories. Bear in mind that in the graphs below the left ratio is not plotted but total count is. The departments with the highest left ratio are HR, Accounting and Technical departments. We already noticed that salary category did not give much insight about the characteristics of people left. However, if one is paid high in this company he/she is unlikely to resing.
p9 <- ggplot(d, aes(x=d$departments,fill=as.character(d$left)))+
theme_economist() +
scale_color_economist()+
theme(axis.text.x = element_text(angle = 90, hjust=0))+
geom_bar() +
labs(x="Departments", y="Employee Count", fill="Left or Not")
p10 <- ggplot(d, aes(x=d$salary,fill=as.character(d$left) ))+
theme_economist() +
scale_color_economist()+
theme(axis.text.x = element_text(angle = 90, hjust=0))+
geom_bar() +
labs(x="Salary", y ="Employee Count",fill="Left or Not")
multiplot(p9,p10,cols=2)
When we checked the correlation matrix of the whole company, we see that left rate is negatively correlated with job satisfaction level with a coefficient of -0.39. Interestingly people who left seem the ones with the high satisfaction level. However, we need to dive deep into tha data to claim such a thing. Other results that we deduced from the correlation matrix are average monthly working hours are positively correlated with the number of projects that an employee works on per year with a coefficient of 0.42 and the last evaluation score is positively correlated with number of prokect with a coefficient of 0.35. It is obvious that employees who work on more project needs to stay in the office for longer hours. This company’s evaluation system depends on hard work. Unfortunately all the effor that employees put on their work does not return them as salary increase or promotion. May be that is why they leave!
CorrMat <- round(cor(d[,0:8]),3)
ggcorrplot(CorrMat, hc.order = TRUE,
#type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of Numeric Values",
ggtheme=theme_economist)
Maybe it is better to focus on the people who quit, already. For this reason we filtered our data.
dLeft <- d %>% filter(left==1) %>% select(-left)
summary(dLeft)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.4500 Min. :2.000 Min. :126.0
## 1st Qu.:0.1300 1st Qu.:0.5200 1st Qu.:2.000 1st Qu.:146.0
## Median :0.4100 Median :0.7900 Median :4.000 Median :224.0
## Mean :0.4401 Mean :0.7181 Mean :3.856 Mean :207.4
## 3rd Qu.:0.7300 3rd Qu.:0.9000 3rd Qu.:6.000 3rd Qu.:262.0
## Max. :0.9200 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident promotion_last_5years
## Min. :2.000 Min. :0.00000 Min. :0.000000
## 1st Qu.:3.000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :4.000 Median :0.00000 Median :0.000000
## Mean :3.877 Mean :0.04733 Mean :0.005321
## 3rd Qu.:5.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :6.000 Max. :1.00000 Max. :1.000000
##
## departments salary
## sales :1014 high : 82
## technical : 697 low :2172
## support : 555 medium:1317
## IT : 273
## hr : 215
## accounting: 204
## (Other) : 613
We pairwise plotted job satisfaction level versus average montly hours and also satisfaction level versus time spent in the company. Meanwhile, we also highlighted employees left the company.
#Define three main groups which left the company
d$left <- factor(d$left, labels=c("Working", "Quitted"))
gg <- ggplot(d,aes(satisfaction_level,average_montly_hours)) + theme_economist() + scale_color_economist()
gg <- gg +geom_point(aes(col=factor(d$left)))
gg + geom_encircle(data=subset(d,
satisfaction_level>=0.05 &
satisfaction_level<0.12 &
average_montly_hours>240 &
average_montly_hours<=310), colour="blue",expand=0.05,spread=0.02,size=2) +
geom_encircle(data=subset(d,
satisfaction_level>=0.35 &
satisfaction_level<0.47 &
average_montly_hours>122 &
average_montly_hours<164), colour="blue", expand=0,spread=0.02,size=2) +
geom_encircle(data=subset(d,
satisfaction_level>=0.47 &
satisfaction_level<=1 &
average_montly_hours>129 &
average_montly_hours<279), colour="green", expand=0, spread=0.02,size=2) +
geom_encircle(data=subset(d,
satisfaction_level>=0.715 &
satisfaction_level<=0.9 &
average_montly_hours>215 &
average_montly_hours<278), colour="blue", expand=0, spread=0.02,size=2) +
scale_x_continuous(breaks=seq(0.08, 1, 0.08), limits = c(0.08, 1)) +
scale_y_continuous(breaks=seq(100, 315, 30), limits = c(100, 315)) +
labs(x="Satisfaction Level",y="Avg. Monthly Hours", title="Satisfaction Level vs. Average Monthly Hours",col="Left Work")
## Warning: Removed 54 rows containing missing values (geom_point).
From the satisfaction vs. monthly hours plot, three seperate clusters are clearly seen with following properties:
It is no surprise that employees belonging to the first group quits the job. Furthermore, it would not be wrong to call the second group as unmotivated and the last group as motivated. It is very surprising to see motivated employees with such a high satisfaction level (>0.8) and long working hours leave the company. Those ones also must be the ones who works on more projects than the average, since average monthly hours highly correlates with number of projects.
We are also curious whether the time spent in the company effect employee’s resignation. When we plotted time spent with respect to satisfaction level.
gg <- ggplot(d,aes(satisfaction_level,time_spend_company))
gg <- gg + geom_point(aes(col=factor(d$left))) + theme_economist() + scale_color_economist()
gg + geom_encircle(data=subset(d,
satisfaction_level>=0.72 &
satisfaction_level<=0.92 &
time_spend_company>4 &
time_spend_company<6.02), colour="blue", expand=0.02,spread=0.02,size=2)+
geom_encircle(data=subset(d,
satisfaction_level>=0.36 &
satisfaction_level<=0.47 &
time_spend_company==3 ), colour="blue", expand=0.02,spread=0.02,size=2)+
geom_encircle(data=subset(d,
satisfaction_level>=0.08 &
satisfaction_level<=0.12 &
time_spend_company>=3 &
time_spend_company<=5), colour="blue", expand=0.02,spread=0.02,size=2)+
scale_x_continuous(breaks=seq(0.08, 1, 0.08), limits = c(0.08, 1)) +
scale_y_continuous(breaks = seq(0,10,by=2), limits = c(0,10))+
labs(x="Satisfaction Level",y="Time Spend In Company (Years)", title="Satisfaction Level vs. Time Spent In Company (Years)",col="Left Work")
In this new plot wee found the traces of previous groups above. The ones who left localized in the same satisfaction level ranges. So, we could attribute time spent to define the three groups above. For example, it is seen that the motivated last group leaves the company between four and six years. Their high motivation let them to keep the job for longer periods. May be they did not get the promotion or salary increase they expected through out the years and they leave at the end. Similarly, When we evaluated the other two groups in the last two graphs together, new characteristics of the groups become:
Interestinly, people who passes years thereshold, does not tend to leave the company later. This must be because of the people’s reluctancy to go out of their comfort zone. Altough their salary and satisfaction level is low or at most mediocre, they still want to swim in the sea they know best.
To understand factors which cause resignation decision better, we also plotted the correlation matrix for the numerical variables.
CorrMat <- round(cor(dLeft[,0:7]),3)
ggcorrplot(CorrMat, hc.order = TRUE,
#type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of Numeric Values",
ggtheme=theme_economist)
Correlogram created from numerical values in our dataset shows that Average Monthly Hours is highly correlated with Number Of Projects as expected. The ones who works on more than avarage (3) projects, should spend more time in the company. And also last evaluation score correlates highly with number of projects and average monthly hours. High evaluation score is given to people who works on more projects by spending more time in the company.
With the insight gained from the correlation matrix of the employees who are already left. Before starting to principal component analysis we need to transform salaries to numeric values to see their effect on left rate. We need also to scale our data to avoid the dominance of the columns with high variance.
## Warning: package 'devtools' was built under R version 3.4.3
## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
## Use `force = TRUE` to force installation
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:modeltools':
##
## empty
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: scales
## Warning: package 'scales' was built under R version 3.4.3
##
## Attaching package: 'scales'
## The following objects are masked from 'package:formattable':
##
## comma, percent, scientific
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
names(dLeft)
## [1] "satisfaction_level" "last_evaluation" "number_project"
## [4] "average_montly_hours" "time_spend_company" "Work_accident"
## [7] "promotion_last_5years" "departments" "salary"
dLeft_pca <- dLeft %>% mutate(salary_Num = ifelse(salary == "low", 0, ifelse(salary == "medium", 1, 2))) %>% select(-salary,-departments)
pca_w_scaling<-princomp(as.matrix(dLeft_pca[,1:8]),cor=TRUE)
summary(pca_w_scaling,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.8162209 1.1428322 1.0272619 1.0030200 0.9672535
## Proportion of Variance 0.4123323 0.1632582 0.1319084 0.1257561 0.1169474
## Cumulative Proportion 0.4123323 0.5755905 0.7074989 0.8332550 0.9502024
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.41439073 0.36124911 0.31009692
## Proportion of Variance 0.02146496 0.01631261 0.01202001
## Cumulative Proportion 0.97166737 0.98797999 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## satisfaction_level 0.852 0.352 -0.147
## last_evaluation -0.522 0.433 0.584
## number_project -0.493 -0.317 0.199
## average_montly_hours -0.510 -0.194 0.253 -0.773
## time_spend_company -0.468 0.359 -0.790
## Work_accident 0.686 0.326 -0.649
## promotion_last_5years 0.712 -0.129 0.689
## salary_Num -0.140 0.936 0.320
## Comp.8
## satisfaction_level 0.350
## last_evaluation -0.442
## number_project 0.784
## average_montly_hours -0.199
## time_spend_company -0.166
## Work_accident
## promotion_last_5years
## salary_Num
ggbiplot(pca_w_scaling, labels= dLeft$satisfaction_level) +theme_economist()+ scale_color_economist()
The first four components explains %83 of the variance. With five components even 95% of the variance can be explained. Components that explain lower than 10% of the variance are not considered because they do not contribute much. When we look at the loadings, we understan that PC1 is composed of last evaluation, number of projects, monthly hours and time spent columns; PC2 is composed of satisfaction level, number of projects, monthly hours and time spent columns; PC3 and PC4 are composed of accident, salary and promotion columns. PCA is usefull for dimensionality reduction and we are glad to know that HR data can be explained in number of components lesser than total number of columns.
Principal components (eigenvectors) are linear combinations of actual factors (loadings) with constant coefficients (eigenvalues). By definition they are orthogonal (perpendicular) to eachother. First two eigenvectors maximizes the variance most (explains better). So, when we plot each factor scores in PC1-PC2 plane, we highlighted groups of homogenous individiuals. In the biplot above each point is labelled with satisfaction score, so we can conclude that people left are groupped according to their satisfaction value: very low, medium, high. We believe that groups deduced from PCA matches groups identified from the scatter plots previously.
Also cumulative variance plot is show below:
ggplot(data.frame(pc=1:8,cum_var=c(0.4123323, 0.5755905, 0.7074989, 0.8332550,0.9502024,0.97166737,0.98797999,1.00000000)),aes(x=pc,y=cum_var)) +
theme_economist() +
scale_color_economist() +
geom_point() +
geom_line() +
labs(x="Principal Component Number",
y="Cumulative Proportion",
title="Cumulative Proportion of the Var. Explained by PCs")
We tried K-Means clustering to be able to cluster mass of people who left the company. We applied k-Means with 3 clusters and 30 iterations since we expect to find three groups.
#Filter to get the data from left people only
dleft<- subset(d, left=='Quitted')
dleft<- select(dleft,-left)
#apply k-means with 3 clusters and 30 iterations
set.seed(42)
dleft_Cluster <- kmeans(dleft[, 1:7], 3, nstart = 30)
dleft_Cluster$centers
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.6100000 0.8891911 4.936889 243.6409
## 2 0.4159149 0.5306140 2.178723 144.8322
## 3 0.2511361 0.8628964 5.780275 285.0799
## time_spend_company Work_accident promotion_last_5years
## 1 4.778667 0.05333333 0.0008888889
## 2 3.071733 0.04741641 0.0091185410
## 3 4.262172 0.03870162 0.0037453184
#print(sum(dleft_Cluster$withinss))
When we compare the center values of three clusters with our previously defined groups our scatter plots (Satisfaction Level vs. Avg. Monthly Hours, Satisfaction Level vs. Time Spend In Company), we are able to match them.
Thus, one can conclude that leaving reasons depends on the employee’s characteristics. Ones with medium satisfaction level, do not dedicate to work very much may be because they do not see any future in the company and quit early. Self-motivated ones work patiently long hours for long years but at the end they quit. Even low satisfied ones could work for very long hours they quit in medium terms.
We first set seed to a constant value.
set.seed(42)
Then left column is chosen as response and the rest are chosen as predictors. We put them in a data frame.
d <- d %>% # read in the data
select(left, departments, satisfaction_level,
last_evaluation, number_project, average_montly_hours,
time_spend_company, promotion_last_5years, salary) %>%
mutate(departments = factor(departments),
salary = factor(salary))
head(d)
## # A tibble: 6 x 9
## left departments satisfaction_level last_evaluation number_project
## <fctr> <fctr> <dbl> <dbl> <int>
## 1 Quitted sales 0.38 0.53 2
## 2 Quitted sales 0.80 0.86 5
## 3 Quitted sales 0.11 0.88 7
## 4 Quitted sales 0.72 0.87 5
## 5 Quitted sales 0.37 0.52 2
## 6 Quitted sales 0.41 0.50 2
## # ... with 4 more variables: average_montly_hours <int>,
## # time_spend_company <int>, promotion_last_5years <int>, salary <fctr>
Split the data in train and test
n <- nrow(d)
indices <- sample(n, n*0.7)
train_d <- d[indices, ]
test_d <- d[-indices, ]
head(train_d)
## # A tibble: 6 x 9
## left departments satisfaction_level last_evaluation number_project
## <fctr> <fctr> <dbl> <dbl> <int>
## 1 Working support 0.47 0.43 4
## 2 Working management 0.62 0.61 3
## 3 Working sales 0.90 0.88 4
## 4 Quitted RandD 0.37 0.54 2
## 5 Working technical 0.80 0.70 3
## 6 Working RandD 0.88 0.92 3
## # ... with 4 more variables: average_montly_hours <int>,
## # time_spend_company <int>, promotion_last_5years <int>, salary <fctr>
We fit the model
rtree_fit <- rpart(left ~ ., data = train_d)
rpart.plot(rtree_fit)
To test the fit quality, we applied our model with fitted parameters to test data.
res <- predict(rtree_fit, test_d)
auc(as.numeric(test_d$left) - 1, res[, 2])
## Area under the curve: 0.9768
Prediction success of 97% is very good. We can comfortably say that primary factor that determines an employee’s resignation is satisfaction level. If it is lower than 46%, people tend to quit. Those unhappy guys are divided into two categories:
If he/she works on many projects (more than 2.5 on yearly basis), they still leave if they are desperately unsatisfied with the job (threshold level = 0.11). Those must belong to Cluster 2 (i.e Group 1) because they have such a low satisfaction value but still work on more projects. Since number of projects correlates with average working hours, they must work for long hours to complete the projects.
If he/she works on few projects (less than 2.5 on yearly basis), they still leave even if their bosses are happy about their work (evaluation score > 0.56). Much probably they fall into Cluster 3 (i.e Group 2) because they are not very much satisfied with the job and the number of projects they work on is not much. So they do not commit long hours to work as those folks in the last cluster.
Secondary factor that determines the satisfied employees quit decision is time spent in the company. If someone is working in the same company between 4.5 - 6.5 years, they break their comfort zone and flee because they work for long hours (> 216 h monthly basis) even if their bosses are very happy with their work (evaluation score > 0.80). We belive that this category corresponds to patient, hardworking, motivated employees of Cluster 1 (i.e. Group 3).
Our EDA is in good compliance with the results deduced from K-Means clustering. We are able to match our previously found groups with three clusters. With help of PCA we ensured that employees quitted falls into three subgroups with differing satisfaction scores. Furthermore Decision Tree Analysis supports our findings. As a conlusion, we find that the satisfaction level is the key factor to quit the job.
The company of interest is not paying much to its employees or promotes. Low and medium salary ranges are the highest portion of their pay check. Thus, since everbody is equal in the minimum, salary and promotion are not that important as satisfaction level. If employee’s satisfaction level is higher than 0.46, he/she will not quit in a short period. However, when the satisfaction level decreases, other factors will become important for the resignation decison such as number of projects or working hours.
This company has three different categories of employees left. First group is working very hard with low satisfaction level, it is expected that this group members quit the job. Second group has low(close to medium) satisfaction level and average monthly hours but they leave, for this group time spend company is the decision factor. People are leaving before 4,5 years from company. Most probably, they leave the company earlier if there is no uptunn in their working conditions.
The most interesting group is the last one, those of experienced ones who tend to work for long hours with high satisfaction score. For instance if you work in the company less than 4.5 years with higher evaluation score above 0.8 and working more than 216 hours in monthly, you are very close to leave the company. It is obvious that this group likes to work for long hours due to their intrinsic high motivation but they leave the company around 4-6 years. They must be getting better offers. If you a have passion for your work and feeling that you are not paid what you deserve, surely you leave the company.