No shows are very important for the service sector especially if this service is about health. This data show contains medical appointment record in Brasil including no show information.
What I am trying to achieve is to predict if a person will show up or not.Also try to exhibit statistical information about medical no shows.
Let’s install necesserry packages and load the data.
mydata<-read.csv("KaggleV2-May-2016.csv",header=TRUE, stringsAsFactors = FALSE)
names(mydata)
## [1] "PatientId" "AppointmentID" "Gender" "ScheduledDay"
## [5] "AppointmentDay" "Age" "Neighbourhood" "Scholarship"
## [9] "Hipertension" "Diabetes" "Alcoholism" "Handcap"
## [13] "SMS_received" "No.show"
Sample of the dataset is like below:
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
kable(mydata[1:5, ], caption = "Medical No Shows Sample")
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No.show |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
str(mydata)
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : num 2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : chr "F" "M" "F" "F" ...
## $ ScheduledDay : chr "2016-04-29T18:38:08Z" "2016-04-29T16:08:27Z" "2016-04-29T16:19:04Z" "2016-04-29T17:29:31Z" ...
## $ AppointmentDay: chr "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : chr "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
## $ Scholarship : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hipertension : int 1 0 0 0 1 1 0 0 0 0 ...
## $ Diabetes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Alcoholism : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Handcap : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SMS_received : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No.show : chr "No" "No" "No" "No" ...
We need to work on data structures of variables in order to make some light exploration on the data.For example we can change integers to logical variables sucj-h as scholarship, diabetes etc. because these variables are shown as 1 or 0 in the dataset.
mydata$PatientId <- as.character((mydata$PatientId))
mydata$ScheduledDay<-as.Date(mydata$ScheduledDay)
mydata$AppointmentDay<-as.Date(mydata$AppointmentDay)
mydata$Handcap<- as.logical(mydata$Handcap)
mydata$Scholarship<- as.logical(mydata$Scholarship)
mydata$Hipertension<- as.logical(mydata$Hipertension)
mydata$Diabetes<- as.logical(mydata$Diabetes)
mydata$Alcoholism<- as.logical(mydata$Alcoholism)
mydata$SMS_received<- as.logical(mydata$SMS_received)
str(mydata)
## 'data.frame': 110527 obs. of 14 variables:
## $ PatientId : chr "29872499824296" "558997776694438" "4262962299951" "867951213174" ...
## $ AppointmentID : int 5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
## $ Gender : chr "F" "M" "F" "F" ...
## $ ScheduledDay : Date, format: "2016-04-29" "2016-04-29" ...
## $ AppointmentDay: Date, format: "2016-04-29" "2016-04-29" ...
## $ Age : int 62 56 62 8 56 76 23 39 21 19 ...
## $ Neighbourhood : chr "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
## $ Scholarship : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Hipertension : logi TRUE FALSE FALSE FALSE TRUE TRUE ...
## $ Diabetes : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Alcoholism : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Handcap : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ SMS_received : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ No.show : chr "No" "No" "No" "No" ...
In order to see the number of no shows(no show column get “YEs” value if a patient did not come to appointment :
check<- table(mydata$No.show)
check
##
## No Yes
## 88208 22319
Check the percentage of the patients that did not come to appointment:
options(repos="https://cran.rstudio.com" )
install.packages("formattable")
## Installing package into 'C:/Users/Hp-Nb/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'formattable' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Hp-Nb\AppData\Local\Temp\Rtmp8WQGSg\downloaded_packages
library(formattable)
## Warning: package 'formattable' was built under R version 3.3.3
ratio<-(check["Yes"]/(check["Yes"]+check["No"]))
percent(ratio,1)
## Yes
## 20.2%
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
ggplot(mydata, aes(x=No.show )) + geom_bar()
If we look at the genders and appointment status, we can see that there is no significant difference between males and females.
check2<- table(mydata$No.show, mydata$Gender)
check2
##
## F M
## No 57246 30962
## Yes 14594 7725
ggplot(mydata, aes(x=Gender, fill=No.show )) + geom_bar()
We can evaluate if sending reminder SMS to the patients changes the no show ratio.
ggplot(mydata, aes(x=No.show )) + geom_bar() + facet_grid(.~SMS_received)
According to graph abovei sending reminder SMS could decrease no show ratio!