Recep Durdu

24.10.2017

No shows are very important for the service sector especially if this service is about health. This data show contains medical appointment record in Brasil including no show information.

What I am trying to achieve is to predict if a person will show up or not.Also try to exhibit statistical information about medical no shows.

Let’s install necesserry packages and load the data.

Reading Data

mydata<-read.csv("KaggleV2-May-2016.csv",header=TRUE, stringsAsFactors = FALSE)

names(mydata)
##  [1] "PatientId"      "AppointmentID"  "Gender"         "ScheduledDay"  
##  [5] "AppointmentDay" "Age"            "Neighbourhood"  "Scholarship"   
##  [9] "Hipertension"   "Diabetes"       "Alcoholism"     "Handcap"       
## [13] "SMS_received"   "No.show"
  1. PatientId = ID of the patient
  2. AppointmentID = ID of the appointment
  3. Gender = Gender of patient
  4. ScheduledDay = The day which appintment scheduled
  5. AppointmentDay = The day which appintment planned to occur
  6. Age = Age of the patient
  7. Neighbourhood = The place where hospital located
  8. Scholarship = If the patient has scholarship or not
  9. Hipertension = If the patient has Hipertension or not
  10. Diabetes = If the patient has Diabetes or not
  11. Alcoholism = If the patient has Alcoholism or not
  12. Handcap = If the patient has Handcap or not
  13. SMS_received = If the patient received an SMS for the appointment
  14. No.show = no show information. “Yes” means patient did not come to the appointment, “No” means patient came to appointment.

Sample of the dataset is like below:

library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
kable(mydata[1:5, ], caption = "Medical No Shows Sample")
Medical No Shows Sample
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No.show
2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No

Structure of the Data

str(mydata)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : num  2.99e+13 5.59e+14 4.26e+12 8.68e+11 8.84e+12 ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : chr  "F" "M" "F" "F" ...
##  $ ScheduledDay  : chr  "2016-04-29T18:38:08Z" "2016-04-29T16:08:27Z" "2016-04-29T16:19:04Z" "2016-04-29T17:29:31Z" ...
##  $ AppointmentDay: chr  "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" "2016-04-29T00:00:00Z" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : chr  "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
##  $ Scholarship   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hipertension  : int  1 0 0 0 1 1 0 0 0 0 ...
##  $ Diabetes      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Alcoholism    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Handcap       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SMS_received  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No.show       : chr  "No" "No" "No" "No" ...

We need to work on data structures of variables in order to make some light exploration on the data.For example we can change integers to logical variables sucj-h as scholarship, diabetes etc. because these variables are shown as 1 or 0 in the dataset.

mydata$PatientId <- as.character((mydata$PatientId))
mydata$ScheduledDay<-as.Date(mydata$ScheduledDay)
mydata$AppointmentDay<-as.Date(mydata$AppointmentDay)
mydata$Handcap<- as.logical(mydata$Handcap)
mydata$Scholarship<- as.logical(mydata$Scholarship)
mydata$Hipertension<- as.logical(mydata$Hipertension)
mydata$Diabetes<- as.logical(mydata$Diabetes)
mydata$Alcoholism<- as.logical(mydata$Alcoholism)
mydata$SMS_received<- as.logical(mydata$SMS_received)
str(mydata)
## 'data.frame':    110527 obs. of  14 variables:
##  $ PatientId     : chr  "29872499824296" "558997776694438" "4262962299951" "867951213174" ...
##  $ AppointmentID : int  5642903 5642503 5642549 5642828 5642494 5626772 5630279 5630575 5638447 5629123 ...
##  $ Gender        : chr  "F" "M" "F" "F" ...
##  $ ScheduledDay  : Date, format: "2016-04-29" "2016-04-29" ...
##  $ AppointmentDay: Date, format: "2016-04-29" "2016-04-29" ...
##  $ Age           : int  62 56 62 8 56 76 23 39 21 19 ...
##  $ Neighbourhood : chr  "JARDIM DA PENHA" "JARDIM DA PENHA" "MATA DA PRAIA" "PONTAL DE CAMBURI" ...
##  $ Scholarship   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Hipertension  : logi  TRUE FALSE FALSE FALSE TRUE TRUE ...
##  $ Diabetes      : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Alcoholism    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Handcap       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ SMS_received  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ No.show       : chr  "No" "No" "No" "No" ...

In order to see the number of no shows(no show column get “YEs” value if a patient did not come to appointment :

check<- table(mydata$No.show)
check
## 
##    No   Yes 
## 88208 22319

Check the percentage of the patients that did not come to appointment:

options(repos="https://cran.rstudio.com" )
install.packages("formattable")
## Installing package into 'C:/Users/Hp-Nb/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'formattable' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Hp-Nb\AppData\Local\Temp\Rtmp8WQGSg\downloaded_packages
library(formattable)
## Warning: package 'formattable' was built under R version 3.3.3
ratio<-(check["Yes"]/(check["Yes"]+check["No"]))
percent(ratio,1)
##   Yes 
## 20.2%
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
ggplot(mydata, aes(x=No.show )) + geom_bar()

If we look at the genders and appointment status, we can see that there is no significant difference between males and females.

check2<- table(mydata$No.show, mydata$Gender)
check2
##      
##           F     M
##   No  57246 30962
##   Yes 14594  7725
ggplot(mydata, aes(x=Gender, fill=No.show )) + geom_bar()

We can evaluate if sending reminder SMS to the patients changes the no show ratio.

ggplot(mydata, aes(x=No.show )) + geom_bar() + facet_grid(.~SMS_received)

According to graph abovei sending reminder SMS could decrease no show ratio!