Explanatory Data Analysis on HR Data

Preparation

We start by loading required libraries for this study.

library(tidyverse)
library(ggplot2)
library(ggcorrplot)
library(ggthemes)
library(formattable)
library(htmlwidgets)
library(ggalt)
library(party)
library(rpart)
library(rpart.plot)
library(pROC)

Abstract

The dataset we have choosen is about human resources. Our aim is to answer an interesting question of a company such as “Why are our best and most experienced employees leaving prematurely?” We will try to find an answer to this question by analyzing answers of the employees to the job satisfaction survey and their work related records. The dataset is formed by the Human Resources (HR) department after conducting a survey on their employees. In this study we first run an Explanatory Data Analysis (EDA) on tha data to make it more meaningfull, then we applied Principal Component Analysis (PCA) to understand the factors that cause employess to leave. With the help of k-Means clustering we obtained three different groups of employees that left with similar reasons and then employed a decision three analysis to predict the employees that will possibly leave the company in the future.

Data Description

We used Human Resources Analytics Data from kaggle. This HR data set is obtained from the results of a satisfaction survey the company has carried out on their employees in combination with other HR related records. It consists of 14999 rows and 10 columns. Each row is dedicated for a different employee. Out of 10, 8 columns are in numeric type, while the remaining 2 are in numeric values. Below you can find columns and their explanations, respectively.

1st Column: Satisfaction level
2nd Column: Last evaluation score
3rd Column: Number of projects worked on (yearly basis)
4th Column: Average monthly working hours
5th Column: Time spent in the company (in years)
6th Column: Whether they have had a work accident in the last 2 years
7th Column: Whether they have had a promotion in the last 5 years
8th Column: Departments
9th Column: Salary
10th Column: Whether the employee has left

All the data collected is from last 5 years whereas accident data belongs to past 2 years. This HR database does not take into account the employees that have been fired, transferred or hired in the past year. Our objective is to make predictions about the probabilities that employees may leave their company and what to change to increase their satisfaction levels. We will try to give insights to make best employees more loyal.

Our tentative plan is as follows:

Explore the dimensions of variables
Clean the data if required
Summarise the data and search for correlations wherever possible
Visualize the filtered results
Make conclusions and give insights

Explanatory Data Analysis

Loading the Data

Then we read dataset from csv file and had a quick look at its form.

d=read.csv("HR_comma_sep.csv")
d<- d %>% rename("departments" = "sales") %>% tbl_df()
glimpse(d)

## Observations: 14,999
## Variables: 10
## $ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ departments           <fctr> sales, sales, sales, sales, sales, sale...
## $ salary                <fctr> low, medium, medium, low, low, low, low...

As you can see we have 14999 observations for job satisfaction level, latest evaluation (yearly), number of projects worked on, average monthly hours, time spend in the company (in years), work accident (within the past 2 years), promotion within the past 5 years, department and salary. All of them are numeric values except last two columns.

We also checked whether there is any NA or NaN values.

which(is.na.data.frame(d))

## integer(0)

Since there aren’t any NA/NaN values, no need to omit or complete them. Thus we proceed summarizing the data.

summary(d)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years      departments      salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

On average, the employees are 64% satisfied with their jobs; they have mean performance score of 0.71 (out of 1); an ordinary employee worked roughly on 3 - 4 different projects per year; it takes them 201.1 hours to complete their jobs per month; an avarage employee worked almost 3.5 years in the company (min = 2, max = 10); attrition rate is 23.81%; unfortunately only 2.1% of employees were promoted in the last 5 years. 8.24% of the employees are highly paid, 48.78% is lowly paid, 42.98% gets medium wage.

Here the employee’s salary ranges can be seen in more detail:

d %>% 
  count(salary) %>% 
  formattable(align = 'l')

salary	n
high	1237
low	7316
medium	6446

We can see clearly that the range of salaries are not well balanced. Only 8% of the employees are paid highly, while 48% of them are in the low salary range.

Departmentwise employee count is as follows:

d %>% 
  count(departments) %>% 
  formattable(align = 'l')

departments	n
accounting	767
hr	739
IT	1227
management	630
marketing	858
product_mng	902
RandD	787
sales	4140
support	2229
technical	2720

As you can see Sales Department is the largest and Management is the smallest in size.

Departmentwise salary frequencies are shown below:

d %>% 
  count(departments, salary) %>% group_by(departments) %>% 
  mutate(
    salary = ordered(salary, c("low", "medium", "high")),
    n = percent(n, 0),
    n = n/sum(n)
  ) %>% 
  spread(salary, n) %>% 
  formattable(list(area(T, 2:4) ~ color_tile("grey", "pink")), align = 'l')

departments	low	medium	high
accounting	0.46675359	0.43676662	0.09647979
hr	0.45331529	0.48579161	0.06089310
IT	0.49633252	0.43602282	0.06764466
management	0.28571429	0.35714286	0.35714286
marketing	0.46853147	0.43822844	0.09324009
product_mng	0.50000000	0.42461197	0.07538803
RandD	0.46251588	0.47268107	0.06480305
sales	0.50700483	0.42801932	0.06497585
support	0.51413190	0.42261104	0.06325707
technical	0.50441176	0.42169118	0.07389706

As it is better visualised in the table above, only Management Department is well balanced in terms of salary. Other than that the rest of the departments are paid in the lower and medium wage nearly equally. Since nearly all of the employess work for low and medium wage, salary alone cannot be a dominant factor for the decision of quiting.

Now, having known all the column names and their related statistics, we wanted to visualize the distribution of the numerical ones, just to get better understanding for further interpretations. For this reason we employed histogram plots. Notice that bin sizes are adapted with respect to varying ranges of the data.

p1 <- ggplot(subset(d ), aes(x=satisfaction_level, colour=satisfaction_level)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_histogram(binwidth = 0.05,color = 'black', fill = '#999999')  +
  ggtitle("Satisfaction Level") +
  labs(x="Job Satis. Score", y="Employee Count", size = 8)



p2 <- ggplot(subset(d ), aes(x=last_evaluation, colour=last_evaluation)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_histogram(binwidth = 0.05,color = 'black', fill = '#009E73') +
  ggtitle("Last Evaluation") +
  labs(x="Performance Score", y="Employee Count", size = 8)

p3 <- ggplot(subset(d ), aes(x=number_project, colour=number_project)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_histogram(binwidth = 1,color = 'black', fill = '#56B4E9')  +
  ggtitle("Yearly Projects") +
  labs(x="Number of Projects", y="Employee Count", size = 8)

p4 <- ggplot(subset(d ), aes(x=average_montly_hours, colour=average_montly_hours)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_histogram(binwidth = 10,color = 'black', fill = '#E69F00')  +
  ggtitle("Average Monthly Hours")+
  labs(x="Monthly Hours", y="Employee Count", size = 8)
  

p5 <- ggplot(subset(d ), aes(x=time_spend_company, colour=time_spend_company)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_histogram(binwidth = 0.5,color = 'black', fill = '#F0E442')  +
  ggtitle("Time Spent in Company") +
  labs(x="Time (Years)", y="Employee Count", size = 8)

p6 <- ggplot(subset(d ), aes(x=Work_accident, colour=Work_accident)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_bar(color = 'black', fill = '#0072B2')+ ## +
  ggtitle("Work Accidents") +
  labs(x="Accident Count", y="Employee Count", size = 8) +
  scale_x_continuous(breaks = c(0,1,1))

p7 <- ggplot(subset(d ), aes(x=left, colour=left)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  geom_bar(color = 'black', fill = '#D55E00') +
  ggtitle("Employees Left") +
  labs(x="Left Count", y="Employee Count", size = 8) +
  scale_x_continuous(breaks = c(0,1,1))


p8 <- ggplot(subset(d), aes(x=promotion_last_5years, colour=promotion_last_5years)) +
  theme_economist() + 
  scale_color_economist()+
  theme(plot.title = element_text(size=12)) +
  scale_x_continuous(breaks = c(0,1,1)) +
  geom_bar(color = 'black', fill = '#CC79A7') +
  ggtitle("Promotion in 5 Years") +
  labs(x="Promotion Count", y="Employee Count", size = 8) 


multiplot(p1,p2,p3,p4,p5,p6,p7,p8,cols=3)

According to above histograms, we found that; there was nearly no promoted employee in the last five years. Bear in mind that the work accident data belongs to last two years and frequency of accidents is low but not zero. Monthly time spent in the company and number of projects which each employee works on both have positively skewed distributions. So, instead of mean, median value must be considered. Thus, a common employee works 4 projects per year, spending 200 hours per month. Average monthly hours and last evauation score histograms are bimodal. There are two main group of workers in this company: one works for short (~150 h) on works for very long (~260 h) with medium (~0.55) and high evaluation scores (~0.85).

Let’s have a look at departmentwise job satisfaction level of the employees.

  dep_satis <- d %>%
  group_by(departments) %>%
  summarise (Satisfaction =mean(satisfaction_level))
  ggplot(data=dep_satis, aes(x=departments, y=Satisfaction)) +
  theme_economist() + 
  scale_color_economist()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.5),axis.text = element_text(size=10) )+
  geom_bar(stat="identity", position=position_dodge(), fill="pink", colour="black")+
  labs(x= "Departments", y="Job Satisfaction Level")

It appears that the level of satisfaction is pretty much the same for all the departments except for some unhappy folks in the accounting department. While rest of the departments average satisfaction value is around 60% (which complies with overall satisfaction rate of 61.28%), accounting department’s mean is below the company’s mean.

depLeftSatis <- d %>% group_by(departments) %>% summarise(Total =n(),LeftCount = sum(left), Satisfaction = mean(satisfaction_level) ) %>% mutate(LeftRate = LeftCount/Total) 
                                                         
                

ggplot(data = depLeftSatis, aes( x = reorder(departments, -LeftRate), y = LeftRate, fill = Satisfaction)) +
    theme_economist() + 
    scale_color_economist()+
    theme(axis.text.x = element_text(angle = 90, hjust=0.5))+
    geom_col()+labs(x= "Departments", y="Left Rate")

Among all the departments, HR is the one with highest attrition rate, then follows Accounting. The top three departments are also the most unsatisfied ones in which Accounting peaks in unhappiness. From the graph above we deduced that Satisfaction Level is a pretty important factor in quitting the company.

We also replotted the graph above by filling each bar according to the salary distribution of the department of interest. The salary imbalance within the departments as well as the whole company is apperent again.

depLeftSatis <- d %>% group_by(departments,salary) %>% summarise(Total =n(),LeftCount = sum(left), Satisfaction = mean(satisfaction_level) ) %>% mutate(LeftRate = LeftCount/Total)

ggplot(data = depLeftSatis, aes( x = reorder(departments, -LeftRate), y = LeftRate, fill=salary)) +
    theme_economist() + 
    scale_color_economist()+
    theme(axis.text.x = element_text(angle = 90, hjust=0.5)) +
    geom_bar(stat="identity",position = "stack")+labs(x= "Departments", y="Left Rate")

Notice that altough job satisfaction level is uniformly distributed among the departments, the left rate is not, it is maximized at the HR department. So there must be other factors that effect employee’s decision to sign-out, such as long working hours.

Now, let’s look at the categorical variables; salary and departments. We wanted to compare the resigned and the remained employees by departments and salary categories. Bear in mind that in the graphs below the left ratio is not plotted but total count is. The departments with the highest left ratio are HR, Accounting and Technical departments. We already noticed that salary category did not give much insight about the characteristics of people left. However, if one is paid high in this company he/she is unlikely to resing.

p9 <- ggplot(d, aes(x=d$departments,fill=as.character(d$left)))+
      theme_economist() + 
      scale_color_economist()+
      theme(axis.text.x = element_text(angle = 90, hjust=0))+
      geom_bar() +
      labs(x="Departments", y="Employee Count", fill="Left or Not")

p10 <- ggplot(d, aes(x=d$salary,fill=as.character(d$left) ))+
       theme_economist() + 
       scale_color_economist()+
       theme(axis.text.x = element_text(angle = 90, hjust=0))+
       geom_bar() +
       labs(x="Salary", y ="Employee Count",fill="Left or Not")

multiplot(p9,p10,cols=2)

When we checked the correlation matrix of the whole company, we see that left rate is negatively correlated with job satisfaction level with a coefficient of -0.39. Interestingly people who left seem the ones with the high satisfaction level. However, we need to dive deep into tha data to claim such a thing. Other results that we deduced from the correlation matrix are average monthly working hours are positively correlated with the number of projects that an employee works on per year with a coefficient of 0.42 and the last evaluation score is positively correlated with number of prokect with a coefficient of 0.35. It is obvious that employees who work on more project needs to stay in the office for longer hours. This company’s evaluation system depends on hard work. Unfortunately all the effor that employees put on their work does not return them as salary increase or promotion. May be that is why they leave!

CorrMat <- round(cor(d[,0:8]),3)
ggcorrplot(CorrMat, hc.order = TRUE, 
           #type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Numeric Values", 
           ggtheme=theme_economist)

Maybe it is better to focus on the people who quit, already. For this reason we filtered our data.

dLeft <- d %>% filter(left==1) %>% select(-left)
summary(dLeft)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.4500   Min.   :2.000   Min.   :126.0       
##  1st Qu.:0.1300     1st Qu.:0.5200   1st Qu.:2.000   1st Qu.:146.0       
##  Median :0.4100     Median :0.7900   Median :4.000   Median :224.0       
##  Mean   :0.4401     Mean   :0.7181   Mean   :3.856   Mean   :207.4       
##  3rd Qu.:0.7300     3rd Qu.:0.9000   3rd Qu.:6.000   3rd Qu.:262.0       
##  Max.   :0.9200     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident     promotion_last_5years
##  Min.   :2.000      Min.   :0.00000   Min.   :0.000000     
##  1st Qu.:3.000      1st Qu.:0.00000   1st Qu.:0.000000     
##  Median :4.000      Median :0.00000   Median :0.000000     
##  Mean   :3.877      Mean   :0.04733   Mean   :0.005321     
##  3rd Qu.:5.000      3rd Qu.:0.00000   3rd Qu.:0.000000     
##  Max.   :6.000      Max.   :1.00000   Max.   :1.000000     
##                                                            
##      departments      salary    
##  sales     :1014   high  :  82  
##  technical : 697   low   :2172  
##  support   : 555   medium:1317  
##  IT        : 273                
##  hr        : 215                
##  accounting: 204                
##  (Other)   : 613

We pairwise plotted job satisfaction level versus average montly hours and also satisfaction level versus time spent in the company. Meanwhile, we also highlighted employees left the company.

#Define three main groups which left the company 
d$left <- factor(d$left, labels=c("Working", "Quitted"))

gg <- ggplot(d,aes(satisfaction_level,average_montly_hours)) +   theme_economist() + scale_color_economist()

gg <- gg +geom_point(aes(col=factor(d$left)))

gg + geom_encircle(data=subset(d, 
                               satisfaction_level>=0.05 & 
                                 satisfaction_level<0.12 & 
                                 average_montly_hours>240 & 
                                 average_montly_hours<=310), colour="blue",expand=0.05,spread=0.02,size=2) +
  geom_encircle(data=subset(d, 
                               satisfaction_level>=0.35 & 
                                 satisfaction_level<0.47 & 
                                 average_montly_hours>122 & 
                                 average_montly_hours<164), colour="blue", expand=0,spread=0.02,size=2) +
  geom_encircle(data=subset(d, 
                               satisfaction_level>=0.47 & 
                                 satisfaction_level<=1 & 
                                 average_montly_hours>129 & 
                                 average_montly_hours<279), colour="green", expand=0, spread=0.02,size=2) +
  geom_encircle(data=subset(d, 
                               satisfaction_level>=0.715 & 
                                 satisfaction_level<=0.9 & 
                                 average_montly_hours>215 & 
                                 average_montly_hours<278), colour="blue", expand=0, spread=0.02,size=2) +

  scale_x_continuous(breaks=seq(0.08, 1, 0.08), limits = c(0.08, 1)) + 
  scale_y_continuous(breaks=seq(100, 315, 30), limits = c(100, 315)) +
  labs(x="Satisfaction Level",y="Avg. Monthly Hours", title="Satisfaction Level vs. Average Monthly Hours",col="Left Work")

## Warning: Removed 54 rows containing missing values (geom_point).

From the satisfaction vs. monthly hours plot, three seperate clusters are clearly seen with following properties:

Group 1: Low satisfaction level and long working hours
Group 2: Mediocre satisfaction level and low working hours
Group 3: High satisfaction level and high working hours

It is no surprise that employees belonging to the first group quits the job. Furthermore, it would not be wrong to call the second group as unmotivated and the last group as motivated. It is very surprising to see motivated employees with such a high satisfaction level (>0.8) and long working hours leave the company. Those ones also must be the ones who works on more projects than the average, since average monthly hours highly correlates with number of projects.

We are also curious whether the time spent in the company effect employee’s resignation. When we plotted time spent with respect to satisfaction level.

gg <- ggplot(d,aes(satisfaction_level,time_spend_company))

gg <- gg + geom_point(aes(col=factor(d$left))) +  theme_economist() + scale_color_economist()

gg + geom_encircle(data=subset(d, 
                               satisfaction_level>=0.72 & 
                                 satisfaction_level<=0.92 & 
                                 time_spend_company>4 & 
                                 time_spend_company<6.02), colour="blue", expand=0.02,spread=0.02,size=2)+
  geom_encircle(data=subset(d, 
                               satisfaction_level>=0.36 & 
                                 satisfaction_level<=0.47 & 
                                 time_spend_company==3 ), colour="blue", expand=0.02,spread=0.02,size=2)+
  geom_encircle(data=subset(d, 
                               satisfaction_level>=0.08 & 
                                 satisfaction_level<=0.12 & 
                                 time_spend_company>=3 &
                                time_spend_company<=5), colour="blue", expand=0.02,spread=0.02,size=2)+

  scale_x_continuous(breaks=seq(0.08, 1, 0.08), limits = c(0.08, 1)) + 
  scale_y_continuous(breaks =  seq(0,10,by=2), limits = c(0,10))+
  labs(x="Satisfaction Level",y="Time Spend In Company (Years)", title="Satisfaction Level vs. Time Spent In Company (Years)",col="Left Work")

In this new plot wee found the traces of previous groups above. The ones who left localized in the same satisfaction level ranges. So, we could attribute time spent to define the three groups above. For example, it is seen that the motivated last group leaves the company between four and six years. Their high motivation let them to keep the job for longer periods. May be they did not get the promotion or salary increase they expected through out the years and they leave at the end. Similarly, When we evaluated the other two groups in the last two graphs together, new characteristics of the groups become:

Group 1: Low satisfaction level, long working hours and leaves in medium term (3-5 years)
Group 2: Mediocre satisfaction level, low working hours and leaves early (around 3 years)
Group 3: High satisfaction level and high working hours and leaves late (5-6 years)

Interestinly, people who passes years thereshold, does not tend to leave the company later. This must be because of the people’s reluctancy to go out of their comfort zone. Altough their salary and satisfaction level is low or at most mediocre, they still want to swim in the sea they know best.

To understand factors which cause resignation decision better, we also plotted the correlation matrix for the numerical variables.

CorrMat <- round(cor(dLeft[,0:7]),3)
ggcorrplot(CorrMat, hc.order = TRUE, 
           #type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Numeric Values", 
           ggtheme=theme_economist)

Correlogram created from numerical values in our dataset shows that Average Monthly Hours is highly correlated with Number Of Projects as expected. The ones who works on more than avarage (3) projects, should spend more time in the company. And also last evaluation score correlates highly with number of projects and average monthly hours. High evaluation score is given to people who works on more projects by spending more time in the company.

Principal Component Analysis

With the insight gained from the correlation matrix of the employees who are already left. Before starting to principal component analysis we need to transform salaries to numeric values to see their effect on left rate. We need also to scale our data to avoid the dominance of the columns with high variance.

## Warning: package 'devtools' was built under R version 3.4.3

## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
##   Use `force = TRUE` to force installation

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:modeltools':
## 
##     empty

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

## Loading required package: scales

## Warning: package 'scales' was built under R version 3.4.3

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:formattable':
## 
##     comma, percent, scientific

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

names(dLeft)

## [1] "satisfaction_level"    "last_evaluation"       "number_project"       
## [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
## [7] "promotion_last_5years" "departments"           "salary"

dLeft_pca <- dLeft  %>% mutate(salary_Num = ifelse(salary == "low", 0, ifelse(salary == "medium", 1, 2))) %>% select(-salary,-departments)
pca_w_scaling<-princomp(as.matrix(dLeft_pca[,1:8]),cor=TRUE)
summary(pca_w_scaling,loadings=TRUE)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.8162209 1.1428322 1.0272619 1.0030200 0.9672535
## Proportion of Variance 0.4123323 0.1632582 0.1319084 0.1257561 0.1169474
## Cumulative Proportion  0.4123323 0.5755905 0.7074989 0.8332550 0.9502024
##                            Comp.6     Comp.7     Comp.8
## Standard deviation     0.41439073 0.36124911 0.31009692
## Proportion of Variance 0.02146496 0.01631261 0.01202001
## Cumulative Proportion  0.97166737 0.98797999 1.00000000
## 
## Loadings:
##                       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## satisfaction_level            0.852                       0.352 -0.147
## last_evaluation       -0.522                              0.433  0.584
## number_project        -0.493 -0.317                              0.199
## average_montly_hours  -0.510 -0.194                       0.253 -0.773
## time_spend_company    -0.468  0.359                      -0.790       
## Work_accident                        0.686  0.326 -0.649              
## promotion_last_5years                0.712 -0.129  0.689              
## salary_Num                          -0.140  0.936  0.320              
##                       Comp.8
## satisfaction_level     0.350
## last_evaluation       -0.442
## number_project         0.784
## average_montly_hours  -0.199
## time_spend_company    -0.166
## Work_accident               
## promotion_last_5years       
## salary_Num

ggbiplot(pca_w_scaling, labels= dLeft$satisfaction_level) +theme_economist()+ scale_color_economist()

The first four components explains %83 of the variance. With five components even 95% of the variance can be explained. Components that explain lower than 10% of the variance are not considered because they do not contribute much. When we look at the loadings, we understan that PC1 is composed of last evaluation, number of projects, monthly hours and time spent columns; PC2 is composed of satisfaction level, number of projects, monthly hours and time spent columns; PC3 and PC4 are composed of accident, salary and promotion columns. PCA is usefull for dimensionality reduction and we are glad to know that HR data can be explained in number of components lesser than total number of columns.

Principal components (eigenvectors) are linear combinations of actual factors (loadings) with constant coefficients (eigenvalues). By definition they are orthogonal (perpendicular) to eachother. First two eigenvectors maximizes the variance most (explains better). So, when we plot each factor scores in PC1-PC2 plane, we highlighted groups of homogenous individiuals. In the biplot above each point is labelled with satisfaction score, so we can conclude that people left are groupped according to their satisfaction value: very low, medium, high. We believe that groups deduced from PCA matches groups identified from the scatter plots previously.

Also cumulative variance plot is show below:

ggplot(data.frame(pc=1:8,cum_var=c(0.4123323, 0.5755905, 0.7074989, 0.8332550,0.9502024,0.97166737,0.98797999,1.00000000)),aes(x=pc,y=cum_var)) + 
  theme_economist() +
  scale_color_economist() +
  geom_point() +
  geom_line() +
  labs(x="Principal Component Number",
       y="Cumulative Proportion", 
       title="Cumulative Proportion of the Var. Explained by PCs")

K-Means Clustering

We tried K-Means clustering to be able to cluster mass of people who left the company. We applied k-Means with 3 clusters and 30 iterations since we expect to find three groups.

#Filter to get the data from left people only
dleft<- subset(d, left=='Quitted') 
dleft<- select(dleft,-left)


#apply k-means with 3 clusters and 30 iterations
set.seed(42)
dleft_Cluster <- kmeans(dleft[, 1:7], 3, nstart = 30)

dleft_Cluster$centers

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1          0.6100000       0.8891911       4.936889             243.6409
## 2          0.4159149       0.5306140       2.178723             144.8322
## 3          0.2511361       0.8628964       5.780275             285.0799
##   time_spend_company Work_accident promotion_last_5years
## 1           4.778667    0.05333333          0.0008888889
## 2           3.071733    0.04741641          0.0091185410
## 3           4.262172    0.03870162          0.0037453184

#print(sum(dleft_Cluster$withinss))

When we compare the center values of three clusters with our previously defined groups our scatter plots (Satisfaction Level vs. Avg. Monthly Hours, Satisfaction Level vs. Time Spend In Company), we are able to match them.

Cluster 1: High satisfaction level, long working hours, leaves late (corresponds to Group 3)
Cluster 2: Low satisfaction level, very long working hours, leaves at medium term (corresponds to Group 1)
Cluster 3: Medium satisfaction level, low working hours, leaves early (corresponds to Group 2)

Thus, one can conclude that leaving reasons depends on the employee’s characteristics. Ones with medium satisfaction level, do not dedicate to work very much may be because they do not see any future in the company and quit early. Self-motivated ones work patiently long hours for long years but at the end they quit. Even low satisfied ones could work for very long hours they quit in medium terms.

Decision Tree Analysis and Prediction

We first set seed to a constant value.

set.seed(42)

Then left column is chosen as response and the rest are chosen as predictors. We put them in a data frame.

  d <- d %>% # read in the data
  select(left, departments, satisfaction_level, 
         last_evaluation, number_project, average_montly_hours,
         time_spend_company, promotion_last_5years, salary) %>%
  mutate(departments = factor(departments), 
         salary = factor(salary))
head(d)

## # A tibble: 6 x 9
##      left departments satisfaction_level last_evaluation number_project
##    <fctr>      <fctr>              <dbl>           <dbl>          <int>
## 1 Quitted       sales               0.38            0.53              2
## 2 Quitted       sales               0.80            0.86              5
## 3 Quitted       sales               0.11            0.88              7
## 4 Quitted       sales               0.72            0.87              5
## 5 Quitted       sales               0.37            0.52              2
## 6 Quitted       sales               0.41            0.50              2
## # ... with 4 more variables: average_montly_hours <int>,
## #   time_spend_company <int>, promotion_last_5years <int>, salary <fctr>

Split the data in train and test

n <- nrow(d)
indices <- sample(n, n*0.7)
train_d <- d[indices, ]
test_d <- d[-indices, ]
head(train_d)

## # A tibble: 6 x 9
##      left departments satisfaction_level last_evaluation number_project
##    <fctr>      <fctr>              <dbl>           <dbl>          <int>
## 1 Working     support               0.47            0.43              4
## 2 Working  management               0.62            0.61              3
## 3 Working       sales               0.90            0.88              4
## 4 Quitted       RandD               0.37            0.54              2
## 5 Working   technical               0.80            0.70              3
## 6 Working       RandD               0.88            0.92              3
## # ... with 4 more variables: average_montly_hours <int>,
## #   time_spend_company <int>, promotion_last_5years <int>, salary <fctr>

We fit the model

rtree_fit <- rpart(left ~ ., data = train_d) 
rpart.plot(rtree_fit)

To test the fit quality, we applied our model with fitted parameters to test data.

res <- predict(rtree_fit, test_d)
auc(as.numeric(test_d$left) - 1, res[, 2])

## Area under the curve: 0.9768

Prediction success of 97% is very good. We can comfortably say that primary factor that determines an employee’s resignation is satisfaction level. If it is lower than 46%, people tend to quit. Those unhappy guys are divided into two categories:

If he/she works on many projects (more than 2.5 on yearly basis), they still leave if they are desperately unsatisfied with the job (threshold level = 0.11). Those must belong to Cluster 2 (i.e Group 1) because they have such a low satisfaction value but still work on more projects. Since number of projects correlates with average working hours, they must work for long hours to complete the projects.
If he/she works on few projects (less than 2.5 on yearly basis), they still leave even if their bosses are happy about their work (evaluation score > 0.56). Much probably they fall into Cluster 3 (i.e Group 2) because they are not very much satisfied with the job and the number of projects they work on is not much. So they do not commit long hours to work as those folks in the last cluster.

Secondary factor that determines the satisfied employees quit decision is time spent in the company. If someone is working in the same company between 4.5 - 6.5 years, they break their comfort zone and flee because they work for long hours (> 216 h monthly basis) even if their bosses are very happy with their work (evaluation score > 0.80). We belive that this category corresponds to patient, hardworking, motivated employees of Cluster 1 (i.e. Group 3).

Conclusion

Our EDA is in good compliance with the results deduced from K-Means clustering. We are able to match our previously found groups with three clusters. With help of PCA we ensured that employees quitted falls into three subgroups with differing satisfaction scores. Furthermore Decision Tree Analysis supports our findings. As a conlusion, we find that the satisfaction level is the key factor to quit the job.

The company of interest is not paying much to its employees or promotes. Low and medium salary ranges are the highest portion of their pay check. Thus, since everbody is equal in the minimum, salary and promotion are not that important as satisfaction level. If employee’s satisfaction level is higher than 0.46, he/she will not quit in a short period. However, when the satisfaction level decreases, other factors will become important for the resignation decison such as number of projects or working hours.

This company has three different categories of employees left. First group is working very hard with low satisfaction level, it is expected that this group members quit the job. Second group has low(close to medium) satisfaction level and average monthly hours but they leave, for this group time spend company is the decision factor. People are leaving before 4,5 years from company. Most probably, they leave the company earlier if there is no uptunn in their working conditions.

The most interesting group is the last one, those of experienced ones who tend to work for long hours with high satisfaction score. For instance if you work in the company less than 4.5 years with higher evaluation score above 0.8 and working more than 216 hours in monthly, you are very close to leave the company. It is obvious that this group likes to work for long hours due to their intrinsic high motivation but they leave the company around 4-6 years. They must be getting better offers. If you a have passion for your work and feeling that you are not paid what you deserve, surely you leave the company.

References

Human Resources Analytics Data Set

Explore & Explain: Density & Decision Trees

Titanic: Getting Starting with R - Part 3: Decision Trees

Intro to Decision Trees with R Example

StackExchange