I will try to explore the Titanic disaster training set available from Kaggle.com. The data set consists of 1309 paasengers who rode aboard the Titanic.

  1. Explore the variables

The first step in EDA is reading in data and then exploring the variables.

Overview

The data has been split into two groups:

Gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

DATA DICTIONARY

VariableDefinitionKey survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

##Load all the library required 

library('ggplot2')
## Warning: package 'ggplot2' was built under R version 3.4.3
library('caret') 
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
library('dplyr') 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('randomForest') 
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library('rpart')
## Warning: package 'rpart' was built under R version 3.4.3
library('rpart.plot')
library('car')
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library('e1071')
## Warning: package 'e1071' was built under R version 3.4.3
library('ggthemes')
library('corrplot')
## corrplot 0.84 loaded
library('plyr')
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
##Lets Load raw data in the orginal form by setting stringsAsFactors = F

train.tit <- read.csv('C:/Titanic_Data/train.csv', stringsAsFactors = F)
test.tit  <- read.csv('C:/Titanic_Data/test.csv', stringsAsFactors = F)
test.tit$Survived <- NA

##Combine both test and train
full_titanic <- rbind(train.tit, test.tit)
##Check the structure
str(full_titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Check the dimensions of the data

dim(full_titanic)
## [1] 1309   12

The output shows us that data set of 1309 records and 12 columns.

Several of the column variables are encoded as numeric data types (ints and floats) but a few of them are encoded as “object”.

Let’s check the head of the data to get a better sense of what the variables look like:

Check the first 5 rows

print(head(full_titanic,5))
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S

We can see that this dataset is consists of numeric columns and columns with text data.

Let’s look at a statistical summary of the variables with summary

After getting a sense of the data’s structure, it is a good idea to look at a statistical summary of the variables with df.describe():

summary(full_titanic)
##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
glimpse(full_titanic)
## Observations: 1,309
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <chr> "male", "female", "female", "female", "male", "mal...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...

Missing value imputation

We can see from summary that Age, Fare, and Embarked have missing values, and that there is a large range in Fare. Naturally, Survived is missing for all test data rows.

###is there any Missing obesrvation
colSums(is.na(full_titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0
####Empty data
colSums(full_titanic=='')
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0          NA           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0          NA        1014           2
names <- full_titanic$Name
title <-  gsub("^.*, (.*?)\\..*$", "\\1", names)

full_titanic$title <- title

table(title) 
## title
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1
###MISS, Mrs, Master and Mr are taking more numbers

###Better to group Other titles into bigger basket by checking gender and survival rate to aviod any overfitting


full_titanic$title[full_titanic$title == 'Mlle']        <- 'Miss' 
full_titanic$title[full_titanic$title == 'Ms']          <- 'Miss'
full_titanic$title[full_titanic$title == 'Mme']         <- 'Mrs' 
full_titanic$title[full_titanic$title == 'Lady']          <- 'Miss'
full_titanic$title[full_titanic$title == 'Dona']          <- 'Miss'

## I am afraid creating a new varible with small data can causes a overfit
## However, My thinking is that combining below feauter into original variable may loss some predictive power as they are all army folks, doctor and nobel peoples 

full_titanic$title[full_titanic$title == 'Capt']        <- 'Officer' 
full_titanic$title[full_titanic$title == 'Col']        <- 'Officer' 
full_titanic$title[full_titanic$title == 'Major']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Dr']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Rev']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Don']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Sir']   <- 'Officer'
full_titanic$title[full_titanic$title == 'the Countess']   <- 'Officer'
full_titanic$title[full_titanic$title == 'Jonkheer']   <- 'Officer'


# Lets check who among Mr, Master, Miss having a better survival rate
 ggplot(full_titanic[1:891,],aes(x = title,fill=factor(Survived))) +
  geom_bar() +
  ggtitle("Title V/S Survival rate")+
  xlab("Title") +
  ylab("Total Count") +
  labs(fill = "Survived") 

# Lets create a Family size using Sibsp and Parch

full_titanic$FamilySize <-full_titanic$SibSp + full_titanic$Parch + 1

full_titanic$FamilySized[full_titanic$FamilySize == 1]   <- 'Single'
full_titanic$FamilySized[full_titanic$FamilySize < 5 & full_titanic$FamilySize >= 2]   <- 'Small'
full_titanic$FamilySized[full_titanic$FamilySize >= 5]   <- 'Big'

full_titanic$FamilySized=as.factor(full_titanic$FamilySized)


###Lets Visualize the Survival rate by Family size 
ggplot(full_titanic[1:891,],aes(x = FamilySized,fill=factor(Survived))) +
  geom_bar() +
  ggtitle("Family Size V/S Survival Rate") +
  xlab("FamilySize") +
  ylab("Total Count") +
  labs(fill = "Survived")

Exploratory Analysis on Embarked

###is there any association between Survial rate and where he get into the Ship.   
 ggplot(full_titanic[1:891,],aes(x = Embarked,fill=factor(Survived))) +
  geom_bar() +
  ggtitle("Embarked vs Survival") +
  xlab("Embarked") +
  ylab("Total Count") +
  labs(fill = "Survived") 

Age vs Survived

ggplot(full_titanic, aes(Age, fill = factor(Survived))) + 
  geom_histogram(bins=30) + 
  theme_few() +
  xlab("Age") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Age vs Survived")
## Warning: Removed 263 rows containing non-finite values (stat_bin).

Sex vs Survived

# Sex vs Survived
ggplot(full_titanic, aes(Sex, fill = factor(Survived))) + 
  geom_bar(stat = "count", position = 'dodge')+
  theme_few() +
  xlab("Sex") +
  ylab("Count") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Sex vs Survived")

tapply(full_titanic$Survived,full_titanic$Sex,mean)
## female   male 
##     NA     NA

Age vs Sex vs Survived

#Sex vs Survived vs Age 
ggplot(full_titanic, aes(Age, fill = factor(Survived))) + 
  geom_histogram(bins=30) + 
  theme_few() +
  xlab("Age") +
  ylab("Count") +
  facet_grid(.~Sex)+
  scale_fill_discrete(name = "Survived") + 
  theme_few()+
  ggtitle("Age vs Sex vs Survived")
## Warning: Removed 263 rows containing non-finite values (stat_bin).

References

https://www.kaggle.com/vincentlugat/titanic-data-analysis-rf-prediction-0-81818/notebook

https://www.kaggle.com/swamysm/beginners-titanic