1 Introduction

There are multiple worksheets within the Excel workbook.

These include;

The workbook revolves around a fictitious company, called Dental Magic, and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

1.1 Load and check data

Here’s several helpful packages to load in R.

# Load packages
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library('dplyr')
library('tidyr')
library(ggplot2)
library(lubridate)
library(highcharter)
library(data.table)
library(plotly)

Now that our packages are loaded, let’s read in and take a peek at the data.

hr <- read.csv("C:/Users/HP/Desktop/BDA/BDA 503/HR/core_dataset.csv")
hr<-data.table(hr)
# check data
str(hr)
## Classes 'data.table' and 'data.frame':   302 obs. of  21 variables:
##  $ Employee.Name      : Factor w/ 302 levels "","Adinolfi, Wilson  K",..: 30 154 264 128 251 255 158 223 26 90 ...
##  $ Employee.Number    : int  1103024456 1106026572 1302053333 1211050782 1307059817 711007713 1102024115 1206043417 1307060188 1201031308 ...
##  $ State              : Factor w/ 29 levels "","AL","AZ","CA",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ Zip                : num  1450 1460 2703 2170 2330 ...
##  $ DOB                : Factor w/ 299 levels "","1/12/1973",..: 53 145 268 276 166 200 196 60 150 135 ...
##  $ Age                : int  32 33 31 32 29 30 33 33 30 38 ...
##  $ Sex                : Factor w/ 4 levels "","Female","male",..: 2 4 4 2 2 2 4 4 2 2 ...
##  $ MaritalDesc        : Factor w/ 6 levels "","Divorced",..: 3 2 5 3 5 3 3 3 3 3 ...
##  $ CitizenDesc        : Factor w/ 4 levels "","Eligible NonCitizen",..: 4 4 4 4 4 4 4 2 4 4 ...
##  $ Hispanic.Latino    : Factor w/ 5 levels "","no","No","yes",..: 3 3 3 3 3 3 3 3 3 2 ...
##  $ RaceDesc           : Factor w/ 7 levels "","American Indian or Alaska Native",..: 4 4 7 7 7 3 7 4 3 7 ...
##  $ Date.of.Hire       : Factor w/ 94 levels "","1/10/2011",..: 19 10 90 29 48 88 9 33 29 6 ...
##  $ Date.of.Termination: Factor w/ 94 levels "","1/11/2014",..: 1 1 1 39 1 85 1 71 1 1 ...
##  $ Reason.For.Term    : Factor w/ 18 levels "","Another position",..: 12 12 12 12 12 4 12 4 12 12 ...
##  $ Employment.Status  : Factor w/ 6 levels "","Active","Future Start",..: 2 2 2 2 2 6 2 6 2 2 ...
##  $ Department         : Factor w/ 8 levels "","Admin Offices",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Position           : Factor w/ 25 levels "","Accountant I",..: 2 2 2 3 3 3 20 20 23 23 ...
##  $ Pay.Rate           : num  28.5 23 29 21.5 16.6 ...
##  $ Manager.Name       : Factor w/ 21 levels "","Alex Sweetwater",..: 5 5 5 5 5 5 11 11 5 4 ...
##  $ Employee.Source    : Factor w/ 23 levels "","Billboard",..: 5 22 9 17 22 5 11 5 5 15 ...
##  $ Performance.Score  : Factor w/ 8 levels "","90-day meets",..: 5 5 5 6 6 5 5 5 2 5 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 302 observations of 21 variables. What about the summary of the fields. Let’s have a look.

summary(hr)
##                   Employee.Name Employee.Number         State    
##                          :  1   Min.   :6.020e+08   MA     :266  
##  Adinolfi, Wilson  K     :  1   1st Qu.:1.102e+09   CT     :  6  
##  Ait Sidi, Karthikeyan   :  1   Median :1.204e+09   TX     :  3  
##  Akinkuolie, Sarah       :  1   Mean   :1.205e+09   VT     :  2  
##  Alagbe,Trina            :  1   3rd Qu.:1.401e+09          :  1  
##  Albert, Michael         :  1   Max.   :1.988e+09   AL     :  1  
##  (Other)                 :296   NA's   :1           (Other): 23  
##       Zip               DOB           Age            Sex     
##  Min.   : 1013   7/7/1984 :  2   Min.   :25.00         :  1  
##  1st Qu.: 1901   9/22/1976:  2   1st Qu.:31.00   Female:174  
##  Median : 2132   9/9/1965 :  2   Median :37.00   male  :  1  
##  Mean   : 6705            :  1   Mean   :38.55   Male  :126  
##  3rd Qu.: 2421   1/12/1973:  1   3rd Qu.:44.00               
##  Max.   :98052   1/12/1974:  1   Max.   :67.00               
##  NA's   :1       (Other)  :293   NA's   :1                   
##     MaritalDesc               CitizenDesc  Hispanic.Latino
##           :  1                      :  1      :  1        
##  Divorced : 30   Eligible NonCitizen: 12   no :  2        
##  Married  :119   Non-Citizen        :  4   No :271        
##  Separated: 14   US Citizen         :285   yes:  1        
##  Single   :127                             Yes: 27        
##  widowed  : 11                                            
##                                                           
##                              RaceDesc      Date.of.Hire
##                                  :  1   1/10/2011: 14  
##  American Indian or Alaska Native:  4   3/30/2015: 12  
##  Asian                           : 31   1/5/2015 : 11  
##  Black or African American       : 54   9/29/2014: 11  
##  Hispanic                        :  4   5/16/2011: 10  
##  Two or more races               : 18   7/5/2011 : 10  
##  White                           :190   (Other)  :234  
##  Date.of.Termination                  Reason.For.Term
##           :199       N/A - still employed     :188   
##  1/9/2012 :  2       Another position         : 20   
##  11/4/2015:  2       unhappy                  : 14   
##  4/1/2013 :  2       more money               : 11   
##  4/4/2014 :  2       N/A - Has not started yet: 11   
##  4/7/2012 :  2       career change            :  9   
##  (Other)  : 93       (Other)                  : 49   
##               Employment.Status                Department 
##                        :  1     Production          :208  
##  Active                :174     IT/IS               : 41  
##  Future Start          : 11     Sales               : 31  
##  Leave of Absence      : 14     Admin Offices       : 10  
##  Terminated for Cause  : 14     Software Engineering:  9  
##  Voluntarily Terminated: 88                         :  1  
##                                 (Other)             :  2  
##                      Position      Pay.Rate             Manager.Name
##  Production Technician I :136   Min.   :14.00   Elijiah Gray  : 22  
##  Production Technician II: 57   1st Qu.:20.00   Kelley Spirea : 22  
##  Area Sales Manager      : 27   Median :24.00   Kissy Sullivan: 22  
##  Production Manager      : 14   Mean   :30.72   Michael Albert: 22  
##  Database Administrator  : 13   3rd Qu.:43.00   Amy Dunn      : 21  
##  Network Engineer        :  9   Max.   :80.00   Brannon Miller: 21  
##  (Other)                 : 46   NA's   :1       (Other)       :172  
##                           Employee.Source                Performance.Score
##  Employee Referral                : 31    Fully Meets             :172    
##  Diversity Job Fair               : 29    N/A- too early to review: 37    
##  Search Engine - Google Bing Yahoo: 25    90-day meets            : 31    
##  Monster.com                      : 24    Exceeds                 : 28    
##  Pay Per Click - Google           : 21    Needs Improvement       : 15    
##  Professional Society             : 19    Exceptional             :  9    
##  (Other)                          :153    (Other)                 : 10

We have one NA in our dataset. Lets remove it out.

#remove 'NAs'
hr <-na.omit(hr)

Also there is some lowercase mismatch in “Sex” and Hispanic.Latino fields.

#fix missmatch
hr$Sex[hr$Sex =="male"] <-'Male'
hr$Hispanic.Latino[hr$Hispanic.Latino =="no"] <-'No'
hr$Hispanic.Latino[hr$Hispanic.Latino =="yes"] <-'Yes'
summary(hr$Sex)
##        Female   male   Male 
##      0    174      0    127
summary(hr$Hispanic.Latino)
##      no  No yes Yes 
##   0   0 273   0  28

2 Basic Statistics and Graphs

When we look at the data, the graph of “Sex”" status of the employees whose status is “still employed” seems to be at the following.

There are 108 employees in our database, 107 of them (%56.9) are women, and 81 of them (%43.1) are males.

sexstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(Sex)][order(-Total)]
sexstatus[,GP:=Total/sum(Total) ,]
sexstatus
##       Sex Total        GP
## 1: Female   107 0.5691489
## 2:   Male    81 0.4308511
plot_ly(sexstatus, labels = ~Sex, values = ~GP, type = 'pie') %>%
  layout(title = 'Performace of Employees by Gender',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

42.9% of female employees are single on the other hand 50.6% of male employees are single.

Martialstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(MaritalDesc,Sex)][order(-Total)]
Martialstatus[,MP:=Total/sum(Total) ,.(Sex)]


hchart(Martialstatus,type="column",hcaes(x=MaritalDesc,y=MP,group=Sex))%>%
  hc_add_theme(hc_theme_google())
Positionstatus<-hr[Reason.For.Term=="N/A - still employed", .(Total = length(unique(Employee.Number))), .(Position,Sex)][order(-Total)]


hchart(Positionstatus,type="column",hcaes(x=Position,y=Total,group=Sex)) %>%
  hc_add_theme(hc_theme_google())

Is there any bias in Payrate against any particular Race.Lets see!

plot1 <-ggplot(hr,aes(x=RaceDesc,y=Pay.Rate)) +
       geom_boxplot() +
       ggtitle("Pay Rate Vs Races-All ")
       
plot1

PayRatevsRaces<-hr[Reason.For.Term=="N/A - still employed", .(MedPayRate = median(Pay.Rate)), .(RaceDesc,Sex)][order(-MedPayRate)]

plot2 <-ggplot(PayRatevsRaces,aes(x=reorder(RaceDesc,-MedPayRate),y=MedPayRate,fill=Sex)) +
  geom_bar(stat="identity",position="dodge") +
  facet_wrap(~Sex) +
  theme(axis.text.x = element_text(angle=90)) +
  ggtitle("Pay Rate Vs Races by Gender")

plot2

It can be seen that Hispanics pay rate is fairly good from others.Especially Hispanic females pay rate is nearly 2 times more from Asians.

Lets look at Source of recruitment.

SourceofEmployees <-hr[,.(Total = length(unique(Employee.Number))),.(Employee.Source,Sex)][order(-Total),]

ggplot(SourceofEmployees,aes(reorder(Employee.Source,-Total),y=Total,fill=Total)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=90)) +
  facet_wrap(~Sex) +
  ggtitle("Source of Employees by Sex")

Most of the source of the employment is through from “Employee Referral” is followed by “Diversity Job Fair”.

When we look at gender wise source of employees we can see that interstingly female ftaff has been recruited via “Job Sites”" and male staff has been recruited via “Employee Referral”.

PerfbyGender<-hr[Performance.Score !='N/A- too early to review',.(Total = length(unique(Employee.Number))),.(Department,Performance.Score,Sex)][order(-Total),]

ggplot(PerfbyGender,aes(x=reorder(Department,-Total),y=Total,fill=Performance.Score)) +
  geom_bar(stat="identity",position="dodge") +
  facet_wrap(~Sex) +
  theme(axis.text.x = element_text(angle=90)) +
  ggtitle("Performace of Employees by Gender")

Production Department has highly skilled labours which meets company expectations , also it has highest number of employees whose performance not upto the mark(Needs improvement/PIP)

PerfbyManager<-hr[,.(Total = length(unique(Employee.Number))),.(Manager.Name,Performance.Score,Sex)][order(-Total),]

ggplot(PerfbyManager,aes(x=Performance.Score,y=Total,fill=Performance.Score)) +
  geom_bar(stat="identity") +
  facet_wrap(~Manager.Name) +
  theme(axis.text.x = element_text(angle=90)) +
  ggtitle("Manager Wise Performance Rating")

When we look at the managers, the answer of which Manager gives better rating is

David Stanely and Janet King gives better rating ,and if you are working under Branon Miller there could be more chances you may get PIP rating(Performance Improvement Programme)

3 Conclusion

Thank you for taking the time to read through my first exploration of a dataset. I look forward to doing more.