There are multiple worksheets within the Excel workbook.
These include;
The workbook revolves around a fictitious company, called Dental Magic, and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.
Here’s several helpful packages to load in R.
# Load packages
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library('dplyr')
library('tidyr')
library(ggplot2)
library(lubridate)
library(highcharter)
library(data.table)
library(plotly)
Now that our packages are loaded, let’s read in and take a peek at the data.
hr <- read.csv("C:/Users/HP/Desktop/BDA/BDA 503/HR/core_dataset.csv")
hr<-data.table(hr)
# check data
str(hr)
## Classes 'data.table' and 'data.frame': 302 obs. of 21 variables:
## $ Employee.Name : Factor w/ 302 levels "","Adinolfi, Wilson K",..: 30 154 264 128 251 255 158 223 26 90 ...
## $ Employee.Number : int 1103024456 1106026572 1302053333 1211050782 1307059817 711007713 1102024115 1206043417 1307060188 1201031308 ...
## $ State : Factor w/ 29 levels "","AL","AZ","CA",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ Zip : num 1450 1460 2703 2170 2330 ...
## $ DOB : Factor w/ 299 levels "","1/12/1973",..: 53 145 268 276 166 200 196 60 150 135 ...
## $ Age : int 32 33 31 32 29 30 33 33 30 38 ...
## $ Sex : Factor w/ 4 levels "","Female","male",..: 2 4 4 2 2 2 4 4 2 2 ...
## $ MaritalDesc : Factor w/ 6 levels "","Divorced",..: 3 2 5 3 5 3 3 3 3 3 ...
## $ CitizenDesc : Factor w/ 4 levels "","Eligible NonCitizen",..: 4 4 4 4 4 4 4 2 4 4 ...
## $ Hispanic.Latino : Factor w/ 5 levels "","no","No","yes",..: 3 3 3 3 3 3 3 3 3 2 ...
## $ RaceDesc : Factor w/ 7 levels "","American Indian or Alaska Native",..: 4 4 7 7 7 3 7 4 3 7 ...
## $ Date.of.Hire : Factor w/ 94 levels "","1/10/2011",..: 19 10 90 29 48 88 9 33 29 6 ...
## $ Date.of.Termination: Factor w/ 94 levels "","1/11/2014",..: 1 1 1 39 1 85 1 71 1 1 ...
## $ Reason.For.Term : Factor w/ 18 levels "","Another position",..: 12 12 12 12 12 4 12 4 12 12 ...
## $ Employment.Status : Factor w/ 6 levels "","Active","Future Start",..: 2 2 2 2 2 6 2 6 2 2 ...
## $ Department : Factor w/ 8 levels "","Admin Offices",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Position : Factor w/ 25 levels "","Accountant I",..: 2 2 2 3 3 3 20 20 23 23 ...
## $ Pay.Rate : num 28.5 23 29 21.5 16.6 ...
## $ Manager.Name : Factor w/ 21 levels "","Alex Sweetwater",..: 5 5 5 5 5 5 11 11 5 4 ...
## $ Employee.Source : Factor w/ 23 levels "","Billboard",..: 5 22 9 17 22 5 11 5 5 15 ...
## $ Performance.Score : Factor w/ 8 levels "","90-day meets",..: 5 5 5 6 6 5 5 5 2 5 ...
## - attr(*, ".internal.selfref")=<externalptr>
We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 302 observations of 21 variables. What about the summary of the fields. Let’s have a look.
summary(hr)
## Employee.Name Employee.Number State
## : 1 Min. :6.020e+08 MA :266
## Adinolfi, Wilson K : 1 1st Qu.:1.102e+09 CT : 6
## Ait Sidi, Karthikeyan : 1 Median :1.204e+09 TX : 3
## Akinkuolie, Sarah : 1 Mean :1.205e+09 VT : 2
## Alagbe,Trina : 1 3rd Qu.:1.401e+09 : 1
## Albert, Michael : 1 Max. :1.988e+09 AL : 1
## (Other) :296 NA's :1 (Other): 23
## Zip DOB Age Sex
## Min. : 1013 7/7/1984 : 2 Min. :25.00 : 1
## 1st Qu.: 1901 9/22/1976: 2 1st Qu.:31.00 Female:174
## Median : 2132 9/9/1965 : 2 Median :37.00 male : 1
## Mean : 6705 : 1 Mean :38.55 Male :126
## 3rd Qu.: 2421 1/12/1973: 1 3rd Qu.:44.00
## Max. :98052 1/12/1974: 1 Max. :67.00
## NA's :1 (Other) :293 NA's :1
## MaritalDesc CitizenDesc Hispanic.Latino
## : 1 : 1 : 1
## Divorced : 30 Eligible NonCitizen: 12 no : 2
## Married :119 Non-Citizen : 4 No :271
## Separated: 14 US Citizen :285 yes: 1
## Single :127 Yes: 27
## widowed : 11
##
## RaceDesc Date.of.Hire
## : 1 1/10/2011: 14
## American Indian or Alaska Native: 4 3/30/2015: 12
## Asian : 31 1/5/2015 : 11
## Black or African American : 54 9/29/2014: 11
## Hispanic : 4 5/16/2011: 10
## Two or more races : 18 7/5/2011 : 10
## White :190 (Other) :234
## Date.of.Termination Reason.For.Term
## :199 N/A - still employed :188
## 1/9/2012 : 2 Another position : 20
## 11/4/2015: 2 unhappy : 14
## 4/1/2013 : 2 more money : 11
## 4/4/2014 : 2 N/A - Has not started yet: 11
## 4/7/2012 : 2 career change : 9
## (Other) : 93 (Other) : 49
## Employment.Status Department
## : 1 Production :208
## Active :174 IT/IS : 41
## Future Start : 11 Sales : 31
## Leave of Absence : 14 Admin Offices : 10
## Terminated for Cause : 14 Software Engineering: 9
## Voluntarily Terminated: 88 : 1
## (Other) : 2
## Position Pay.Rate Manager.Name
## Production Technician I :136 Min. :14.00 Elijiah Gray : 22
## Production Technician II: 57 1st Qu.:20.00 Kelley Spirea : 22
## Area Sales Manager : 27 Median :24.00 Kissy Sullivan: 22
## Production Manager : 14 Mean :30.72 Michael Albert: 22
## Database Administrator : 13 3rd Qu.:43.00 Amy Dunn : 21
## Network Engineer : 9 Max. :80.00 Brannon Miller: 21
## (Other) : 46 NA's :1 (Other) :172
## Employee.Source Performance.Score
## Employee Referral : 31 Fully Meets :172
## Diversity Job Fair : 29 N/A- too early to review: 37
## Search Engine - Google Bing Yahoo: 25 90-day meets : 31
## Monster.com : 24 Exceeds : 28
## Pay Per Click - Google : 21 Needs Improvement : 15
## Professional Society : 19 Exceptional : 9
## (Other) :153 (Other) : 10
We have one NA in our dataset. Lets remove it out.
#remove 'NAs'
hr <-na.omit(hr)
Also there is some lowercase mismatch in “Sex” and Hispanic.Latino fields.
#fix missmatch
hr$Sex[hr$Sex =="male"] <-'Male'
hr$Hispanic.Latino[hr$Hispanic.Latino =="no"] <-'No'
hr$Hispanic.Latino[hr$Hispanic.Latino =="yes"] <-'Yes'
summary(hr$Sex)
## Female male Male
## 0 174 0 127
summary(hr$Hispanic.Latino)
## no No yes Yes
## 0 0 273 0 28
When we look at the data, the graph of “Sex”" status of the employees whose status is “still employed” seems to be at the following.
There are 108 employees in our database, 107 of them (%56.9) are women, and 81 of them (%43.1) are males.
sexstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(Sex)][order(-Total)]
sexstatus[,GP:=Total/sum(Total) ,]
sexstatus
## Sex Total GP
## 1: Female 107 0.5691489
## 2: Male 81 0.4308511
plot_ly(sexstatus, labels = ~Sex, values = ~GP, type = 'pie') %>%
layout(title = 'Performace of Employees by Gender',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
42.9% of female employees are single on the other hand 50.6% of male employees are single.
Martialstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(MaritalDesc,Sex)][order(-Total)]
Martialstatus[,MP:=Total/sum(Total) ,.(Sex)]
hchart(Martialstatus,type="column",hcaes(x=MaritalDesc,y=MP,group=Sex))%>%
hc_add_theme(hc_theme_google())
Positionstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(Position,Sex)][order(-Total)]
hchart(Positionstatus,type="column",hcaes(x=Position,y=Total,group=Sex)) %>%
hc_add_theme(hc_theme_google())
Is there any bias in Payrate against any particular Race.Lets see!
plot1 <-ggplot(hr,aes(x=RaceDesc,y=Pay.Rate)) +
geom_boxplot() +
ggtitle("Pay Rate Vs Races-All ")
plot1
PayRatevsRaces<-hr[Reason.For.Term=="N/A - still employed", .(MedPayRate = median(Pay.Rate)), .(RaceDesc,Sex)][order(-MedPayRate)]
plot2 <-ggplot(PayRatevsRaces,aes(x=reorder(RaceDesc,-MedPayRate),y=MedPayRate,fill=Sex)) +
geom_bar(stat="identity",position="dodge") +
facet_wrap(~Sex) +
theme(axis.text.x = element_text(angle=90)) +
ggtitle("Pay Rate Vs Races by Gender")
plot2
It can be seen that Hispanics pay rate is fairly good from others.Especially Hispanic females pay rate is nearly 2 times more from Asians.
Lets look at Source of recruitment.
SourceofEmployees <-hr[,.(Total = length(unique(Employee.Number))),.(Employee.Source,Sex)][order(-Total),]
ggplot(SourceofEmployees,aes(reorder(Employee.Source,-Total),y=Total,fill=Total)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle=90)) +
facet_wrap(~Sex) +
ggtitle("Source of Employees by Sex")
Most of the source of the employment is through from “Employee Referral” is followed by “Diversity Job Fair”.
When we look at gender wise source of employees we can see that interstingly female ftaff has been recruited via “Job Sites”" and male staff has been recruited via “Employee Referral”.
PerfbyGender<-hr[Performance.Score !='N/A- too early to review',.(Total = length(unique(Employee.Number))),.(Department,Performance.Score,Sex)][order(-Total),]
ggplot(PerfbyGender,aes(x=reorder(Department,-Total),y=Total,fill=Performance.Score)) +
geom_bar(stat="identity",position="dodge") +
facet_wrap(~Sex) +
theme(axis.text.x = element_text(angle=90)) +
ggtitle("Performace of Employees by Gender")
Production Department has highly skilled labours which meets company expectations , also it has highest number of employees whose performance not upto the mark(Needs improvement/PIP)
PerfbyManager<-hr[,.(Total = length(unique(Employee.Number))),.(Manager.Name,Performance.Score,Sex)][order(-Total),]
ggplot(PerfbyManager,aes(x=Performance.Score,y=Total,fill=Performance.Score)) +
geom_bar(stat="identity") +
facet_wrap(~Manager.Name) +
theme(axis.text.x = element_text(angle=90)) +
ggtitle("Manager Wise Performance Rating")
When we look at the managers, the answer of which Manager gives better rating is
David Stanely and Janet King gives better rating ,and if you are working under Branon Miller there could be more chances you may get PIP rating(Performance Improvement Programme)
Thank you for taking the time to read through my first exploration of a dataset. I look forward to doing more.