I will try to explore the Titanic disaster training set available from Kaggle.com. The data set consists of 1309 paasengers who rode aboard the Titanic.
The first step in EDA is reading in data and then exploring the variables.
Overview
The data has been split into two groups:
training set (train.csv) for machine learning models.It will provide the outcome (also known as the “ground truth”) for each passenger and model will be based on “features” like passengers’ gender and class.
test set (test.csv) will be used how well model performs on unseen data.It does not provide the the ground truth for each passenger.
Gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
DATA DICTIONARY
VariableDefinitionKey survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
##Load all the library required
library('ggplot2')
## Warning: package 'ggplot2' was built under R version 3.4.3
library('caret')
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('randomForest')
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library('rpart')
## Warning: package 'rpart' was built under R version 3.4.3
library('rpart.plot')
library('car')
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library('e1071')
## Warning: package 'e1071' was built under R version 3.4.3
library('ggthemes')
library('corrplot')
## corrplot 0.84 loaded
library('plyr')
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##Lets Load raw data in the orginal form by setting stringsAsFactors = F
train.tit <- read.csv('C:/Titanic_Data/train.csv', stringsAsFactors = F)
test.tit <- read.csv('C:/Titanic_Data/test.csv', stringsAsFactors = F)
test.tit$Survived <- NA
##Combine both test and train
full_titanic <- rbind(train.tit, test.tit)
##Check the structure
str(full_titanic)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Check the dimensions of the data
dim(full_titanic)
## [1] 1309 12
The output shows us that data set of 1309 records and 12 columns.
Several of the column variables are encoded as numeric data types (ints and floats) but a few of them are encoded as “object”.
Let’s check the head of the data to get a better sense of what the variables look like:
Check the first 5 rows
print(head(full_titanic,5))
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
We can see that this dataset is consists of numeric columns and columns with text data.
Let’s look at a statistical summary of the variables with summary
After getting a sense of the data’s structure, it is a good idea to look at a statistical summary of the variables with df.describe():
summary(full_titanic)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
glimpse(full_titanic)
## Observations: 1,309
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
Missing value imputation
We can see from summary that Age, Fare, and Embarked have missing values, and that there is a large range in Fare. Naturally, Survived is missing for all test data rows.
###is there any Missing obesrvation
colSums(is.na(full_titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
####Empty data
colSums(full_titanic=='')
## PassengerId Survived Pclass Name Sex Age
## 0 NA 0 0 0 NA
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 NA 1014 2
names <- full_titanic$Name
title <- gsub("^.*, (.*?)\\..*$", "\\1", names)
full_titanic$title <- title
table(title)
## title
## Capt Col Don Dona Dr
## 1 4 1 1 8
## Jonkheer Lady Major Master Miss
## 1 1 2 61 260
## Mlle Mme Mr Mrs Ms
## 2 1 757 197 2
## Rev Sir the Countess
## 8 1 1
###MISS, Mrs, Master and Mr are taking more numbers
###Better to group Other titles into bigger basket by checking gender and survival rate to aviod any overfitting
full_titanic$title[full_titanic$title == 'Mlle'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Ms'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Mme'] <- 'Mrs'
full_titanic$title[full_titanic$title == 'Lady'] <- 'Miss'
full_titanic$title[full_titanic$title == 'Dona'] <- 'Miss'
## I am afraid creating a new varible with small data can causes a overfit
## However, My thinking is that combining below feauter into original variable may loss some predictive power as they are all army folks, doctor and nobel peoples
full_titanic$title[full_titanic$title == 'Capt'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Col'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Major'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Dr'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Rev'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Don'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Sir'] <- 'Officer'
full_titanic$title[full_titanic$title == 'the Countess'] <- 'Officer'
full_titanic$title[full_titanic$title == 'Jonkheer'] <- 'Officer'
# Lets check who among Mr, Master, Miss having a better survival rate
ggplot(full_titanic[1:891,],aes(x = title,fill=factor(Survived))) +
geom_bar() +
ggtitle("Title V/S Survival rate")+
xlab("Title") +
ylab("Total Count") +
labs(fill = "Survived")
# Lets create a Family size using Sibsp and Parch
full_titanic$FamilySize <-full_titanic$SibSp + full_titanic$Parch + 1
full_titanic$FamilySized[full_titanic$FamilySize == 1] <- 'Single'
full_titanic$FamilySized[full_titanic$FamilySize < 5 & full_titanic$FamilySize >= 2] <- 'Small'
full_titanic$FamilySized[full_titanic$FamilySize >= 5] <- 'Big'
full_titanic$FamilySized=as.factor(full_titanic$FamilySized)
###Lets Visualize the Survival rate by Family size
ggplot(full_titanic[1:891,],aes(x = FamilySized,fill=factor(Survived))) +
geom_bar() +
ggtitle("Family Size V/S Survival Rate") +
xlab("FamilySize") +
ylab("Total Count") +
labs(fill = "Survived")
Exploratory Analysis on Embarked
###is there any association between Survial rate and where he get into the Ship.
ggplot(full_titanic[1:891,],aes(x = Embarked,fill=factor(Survived))) +
geom_bar() +
ggtitle("Embarked vs Survival") +
xlab("Embarked") +
ylab("Total Count") +
labs(fill = "Survived")
Age vs Survived
ggplot(full_titanic, aes(Age, fill = factor(Survived))) +
geom_histogram(bins=30) +
theme_few() +
xlab("Age") +
scale_fill_discrete(name = "Survived") +
ggtitle("Age vs Survived")
## Warning: Removed 263 rows containing non-finite values (stat_bin).
Sex vs Survived
# Sex vs Survived
ggplot(full_titanic, aes(Sex, fill = factor(Survived))) +
geom_bar(stat = "count", position = 'dodge')+
theme_few() +
xlab("Sex") +
ylab("Count") +
scale_fill_discrete(name = "Survived") +
ggtitle("Sex vs Survived")
tapply(full_titanic$Survived,full_titanic$Sex,mean)
## female male
## NA NA
Age vs Sex vs Survived
#Sex vs Survived vs Age
ggplot(full_titanic, aes(Age, fill = factor(Survived))) +
geom_histogram(bins=30) +
theme_few() +
xlab("Age") +
ylab("Count") +
facet_grid(.~Sex)+
scale_fill_discrete(name = "Survived") +
theme_few()+
ggtitle("Age vs Sex vs Survived")
## Warning: Removed 263 rows containing non-finite values (stat_bin).
References
https://www.kaggle.com/vincentlugat/titanic-data-analysis-rf-prediction-0-81818/notebook