Introduction

This project, by R_Junkies, is about past airplane crashes, from 1908 to 2017. The project is prepared specifically for BDA 503, 2017, MEF University; course given by Berk Orbay.

About Group Members

We call ourselves: R_Junkies. It represents our humble interest in Data Science. It’s a life-style.

Group members are below (Ladies first):

About R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Gathering Data

Here is the list of datasets and sources we will be using in this project:

  • Airplane Crashes Since 1908 - Kaggle Link
  • Air transport, passengers carried (1970-2016) - World Bank Link
  • Air transport, registered carrier departures worldwide (1970-2016) - World Bank Link
  • Airplane Crashes Since 2009 - Plane Crash Info Link

Our dataset taken from Kaggle contains crashes from 1908 to mid 2009. In order to analyze data more accurately, we crawled dataset from http://www.planecrashinfo.com/. With the following Python code, we created a new file contains only data from 2009 to 2017.


from bs4 import BeautifulSoup
import re
from urllib import request
import datetime
import csv

# Request URL and return response as BeautifulSoup object
def makeBeautifulSoupObject(url):
  requestConn = request.urlopen(url)
responseHTML = requestConn.read()
requestConn.close()
soup = BeautifulSoup(responseHTML, "lxml")
soup.decode(eventual_encoding="UTF-8")
return soup

def parseHTML(table_):
  record = {}
table = BeautifulSoup(str(table_[0]), 'html.parser')

for tr in table.find_all("tr")[1:]:
  tds = tr.find_all("td")
# encoding the 'value' string to utf-8 and removing any non-breaking space (HTML Character)
tmp_str = tds[1].string  # .string.encode('utf-8').replace(" ", "")
value = str(tmp_str)  # this is the value- In Column #2 of the HTML table
key = tds[0].string  # this is the key- In Column #1 of the HTML table
# print(tds[0])
if key == "Date:":
  dat = str(value).replace(',', '')
date = datetime.datetime.strptime(dat, '%B %d %Y')
record["date"] = date
elif key == "Time:":
  if not value == '?':
  time = re.sub("[^0-9]", "", value)
record["time"] = time[0:2] + ":" + time[2:4]
else:
  record["time"] = ''
elif key == "Location:":
  if not value == '?':
  record["loc"] = str(value)
else:
  record["loc"] = ''
elif key == "Operator:":
  if not value == '?':
  record["op"] = str(value)
else:
  record["op"] = ''
elif key == "Flight #:":
  if not value == '?':
  record["flight"] = str(value)
else:
  record["flight"] = ''
elif key == "Route:":
  if not value == '?':
  record["route"] = str(value)
else:
  record["route"] = ''
elif key == "Registration:":
  if not value == '?':
  record["reg"] = str(value).encode("utf-8")
else:
  record["reg"] = ''
elif key == "cn / ln:":
  if not value == '?':
  record["cnln"] = str(value)
else:
  record["cnln"] = ''
elif key == "Aboard:":
  if not value == '?':
  s = ' '.join(value.split())
aboard_ = s.replace('(', '').replace(')', '').split(' ')
if aboard_[0] != '?':
  record["aboard_total"] = aboard_[0]
else:
  record["aboard_total"] = 'NULL'

passengers = aboard_[1].replace("passengers:", "")
if passengers != '?':
  record["aboard_passengers"] = passengers
else:
  record["aboard_passengers"] = 'NULL'

crew = aboard_[2].replace("crew:", "")
if crew != '?':
  record["aboard_crew"] = crew
else:
  record["aboard_crew"] = 'NULL'
else:
  record["aboard_total"] = 'NULL'
record["aboard_passengers"] = 'NULL'
record["aboard_crew"] = 'NULL'
elif key == "Fatalities:":
  if not value == '?':
  s = ' '.join(value.split())
fatalities_ = s.replace('(', '').replace(')', '').split(' ')

if fatalities_[0] != '?':
  record["fatalities_total"] = fatalities_[0]
else:
  record["fatalities_total"] = 'NULL'

passengers = fatalities_[1].replace("passengers:", "")
if passengers != '?':
  record["fatalities_passengers"] = passengers
else:
  record["fatalities_passengers"] = 'NULL'

crew = fatalities_[2].replace("crew:", "")
if crew != '?':
  record["fatalities_crew"] = crew
else:
  record["fatalities_crew"] = 'NULL'
else:
  record["aboard_total"] = 'NULL'
record["aboard_passengers"] = 'NULL'
record["aboard_crew"] = 'NULL'
elif key == "Ground:":
  if not value == '?':
  record["ground"] = str(value)
else:
  record["ground"] = 'NULL'
elif key == "Summary:":
  if not value == '?':
  record["summary"] = str(value).replace('\n','')
else:
  record["summary"] = ''
elif key == "AC Type:":
  if not value == '?':
  record["actype"] = str(value)
else:
  record["actype"] = ''
else:
  st1 = ''.join(tds[0].string.split()).lower()
st1 = st1.replace(':', '')
if not value == '?':
  record[st1] = str(value)
else:
  record[st1] = "NULL"
return record

### Main Program ###

# Root URL
rooturl = "http://www.planecrashinfo.com"
# From which year data is taken
start_year = 2009
# to which year data is taken
end_year = 2009

# Create a new file
f = csv.writer(open("Airplane_Crashes_and_Fatalities_Since_2009.csv", "w", newline=''), delimiter=',')
# Header info
f.writerow(["Date", "Time", "Location", "Operator", "Flight #", "Route", "Type", "Registration", "cn/In", "Aboard", "Fatalities", "Ground", "Summary"])

for i in range(start_year, end_year + 1, 1):
  year_start = datetime.datetime.utcnow()
# appending the path (year) to the url hostname
# Sample page : http://www.planecrashinfo.com/2008/2008.htm
newurl = rooturl + "/" + str(i) + "/" + str(i) + ".htm"
#Get the main page for each year
soup = makeBeautifulSoupObject(newurl)
tables = soup.find_all('table')

#Each page contains the sumamry crash info
for table in tables:
  # finding the no. of records for the given year
  number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table))
number_of_rows = 3

for j in range(1, number_of_rows, 1):
  # appending the row number to sub-path of the url, and building the final url that will be used for sending http request
  #Sample : http://www.planecrashinfo.com/2008/2008-1.htm
  accident_url = newurl.replace(".htm", "") + "-" + str(j) + ".htm"
web_record = makeBeautifulSoupObject(accident_url)
# removing all the boilerplate html code except the data table
table_details = web_record.find_all('table')
#Parse table and return crash record
crashRecord = parseHTML(table_details)
#Write crash record to file
f.writerow([crashRecord["date"].strftime('%m/%d/%Y'),
            crashRecord["time"],
            crashRecord["loc"],
            crashRecord["op"],
            crashRecord["flight"],
            crashRecord["route"],
            crashRecord["actype"],
            crashRecord["reg"],
            crashRecord["cnln"],
            crashRecord["aboard_total"],
            # record["aboard_passengers"],
            # record["aboard_crew"],
            crashRecord["fatalities_total"],
            # record["fatalities_passengers"],
            # record["fatalities_crew"],
            crashRecord["ground"],
            crashRecord["summary"]
            ])

Airplane Crashes

# Original Kaggle file
apc_raw_to_2009 <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/Airplane_Crashes_and_Fatalities_Since_1908.csv", header=TRUE, sep=",")

# Data we crawled from PlaneCrashInfo.com
apc_raw_from_2009 <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/Airplane_Crashes_and_Fatalities_Since_2009.csv", header=TRUE, sep=",")

Passengers Carried

air_psgr <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/is_air_psgr_melt.csv", header=TRUE, sep=",", quote="\"", na.strings="NA")

Registered Carrier Departures Worldwide

air_dprt <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/is_air_dprt_melt.csv",header=TRUE, sep=",", quote="\"", na.strings="NA")


Data Preprocessing & Cleaning

Here are the necessary libraries. We checked if necessary libraries are installed properly.

necessary_lib <- c("stringr","tidyverse","ggthemes","grid","gridExtra","scales","tm","SnowballC","wordcloud","RColorBrewer","plotly","rworldmap")

# Lets check these necessary libraries against already installed packages.
install_lib <- necessary_lib[!(necessary_lib %in% installed.packages())]

if(length(install_lib)!=0){
  install.packages(install_lib)
}

Load necessary libraries

lapply(necessary_lib, require, character.only = TRUE)

Next, we merge the dataset from Kaggle and the dataset we crawled from PlaneCrashInfo.
Kaggle dataset contains duplicate values for 2009 crashes, so we omit them.

apc_raw_to_2009 <-  apc_raw_to_2009 %>%
                    mutate(year=format(as.Date(apc_raw_to_2009$Date, format="%m/%d/%Y"),"%Y")) %>%
                    filter(as.numeric(year) < 2009) %>%
                    select(-year)

# Binding datasets
apc_raw <- rbind(apc_raw_to_2009, apc_raw_from_2009)

We first create a new data frame called “apc_clean” which we will manipulate for our analysis.

# New data frame
apc_clean <- apc_raw

# Dimensions
dim(apc_clean)
## [1] 5543   13
# Get an idea of apc_clean columns
str(apc_clean)
## 'data.frame':    5543 obs. of  13 variables:
##  $ Date        : Factor w/ 5014 levels "01/01/1966","01/01/1970",..: 3297 2372 2699 3184 3682 852 3099 2585 3387 3469 ...
##  $ Time        : Factor w/ 1035 levels "","00:00","00:01",..: 702 200 1 757 384 36 609 1 36 985 ...
##  $ Location    : Factor w/ 4530 levels "","100 miles SW of Kuujjuaq, Quebec, Canada",..: 849 123 4179 3388 2291 4058 3113 2285 283 3550 ...
##  $ Operator    : Factor w/ 2708 levels "","Aaxico Airlines",..: 1577 1590 1825 1466 1466 1466 1466 1465 1466 1466 ...
##  $ Flight..    : Factor w/ 779 levels "","-","002","004",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Route       : Factor w/ 3470 levels "","Abaco - Miami",..: 830 2982 1 1 1 1 1 1 1 1 ...
##  $ Type        : Factor w/ 2614 levels "","AAC-1 Toucan",..: 2419 1145 1024 2434 2437 2446 2433 2148 2439 2438 ...
##  $ Registration: Factor w/ 5184 levels ""," / ","01 / 02 / 03",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cn.In       : Factor w/ 3913 levels "","0001","0001A",..: 170 1 1 1 1 1 1 1 1 1 ...
##  $ Aboard      : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities  : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Ground      : chr  "0" "0" "0" "0" ...
##  $ Summary     : Factor w/ 4959 levels "","31-7952246The pilot reported that he had a rough running engine and was making an emergency landing at Charlo A"| __truncated__,..: 1544 1676 3566 3207 1820 1175 1637 1191 2228 2266 ...

There are columns which we can’t use for any analysis. These are: “Flight..”, “Registration”, “cn.In”, “Ground

# Delete "FLight.." column
apc_clean$Flight.. <- NULL

# Delete "Registration" column
apc_clean$Registration <- NULL

# Delete "cn.In" column
apc_clean$cn.In <- NULL

# Delete "Ground" column
apc_clean$Ground <- NULL

Beautiful. We dropped unnecessary columns.
There are some non-ASCII characters in some cells. We should remove them.

nonutfeight <- apc_clean$Operator
nonutfeight[5315]
## [1] A\xe9ropro
## 2708 Levels:  Aaxico Airlines Ababeel Aviaition ... Yemenia Airway
apc_clean$Operator  <- gsub("[^\x20-\x7E]", "", nonutfeight)
apc_clean$Operator[5315]
## [1] "Aropro"
nonutfeight <- NULL

nonutfeight <- apc_clean$Summary
nonutfeight[5249]
## [1] The plane was being used as an air taxi to ferry passengers between cities when it crashed in the Manacapuru River, 50 miles from Manus. The plane took off under warning of strong winds and rain. The crew asked permission to return shortly after takeoff because they lost an engine.\xa0The plane disappeared from radar and lost contact with ATC.\xa0The plane crashed into the river when the crew tried to land the plane on an abandoned runway in the town of Manacapuru.
## 4959 Levels:  ...
apc_clean$Summary  <- gsub("[^\x20-\x7E]", "", nonutfeight)
apc_clean$Summary[5249]
## [1] "The plane was being used as an air taxi to ferry passengers between cities when it crashed in the Manacapuru River, 50 miles from Manus. The plane took off under warning of strong winds and rain. The crew asked permission to return shortly after takeoff because they lost an engine.The plane disappeared from radar and lost contact with ATC.The plane crashed into the river when the crew tried to land the plane on an abandoned runway in the town of Manacapuru."

Now it’s time to reshape time-related columns.

# Change date format
apc_clean$Date <- as.Date(apc_clean$Date, format = "%m/%d/%Y")
typeof(apc_clean$Date)
## [1] "double"
# Change & clean time format
apc_clean$Time <- gsub('c:', '', apc_clean$Time)
apc_clean$Time <- gsub('c', '', apc_clean$Time)
apc_clean$Time <- factor(as.character(substr(apc_clean$Time, 1, 2)))
typeof(apc_clean$Time)
## [1] "integer"
# New columns for month & year
apc_clean$Year = factor(format(as.Date(apc_clean$Date, format="%Y/%m/%d"),"%Y"))
apc_clean$Month = factor(format(as.Date(apc_clean$Date, format="%Y/%m/%d"),"%m"))

head(apc_clean$Year)
## [1] 1908 1912 1913 1913 1913 1915
## 106 Levels: 1908 1912 1913 1915 1916 1917 1918 1919 1920 1921 1922 ... 2017
head(apc_clean$Month)
## [1] 09 07 08 09 10 03
## Levels: 01 02 03 04 05 06 07 08 09 10 11 12

We can collect state & city information from “Location” column.

# We add new "State" & "City" column
apc_clean$State <- sapply(apc_clean$Location, as.character)
apc_clean$City <- sapply(apc_clean$Location, as.character)

# Seperate State & City
apc_clean$State <- str_trim(gsub(".*,", "", apc_clean$State))
apc_clean$City <- str_trim(gsub(",.*", "", apc_clean$City))

head(apc_clean$State)
## [1] "Virginia"           "New Jersey"         "Canada"            
## [4] "Over the North Sea" "Germany"            "Belgium"
head(apc_clean$City)
## [1] "Fort Myer"          "AtlantiCity"        "Victoria"          
## [4] "Over the North Sea" "Near Johannisthal"  "Tienen"
# Delete unnecessary "Location" column
apc_clean$Location <- NULL

Very nice. Now we are going to label flights as “Civilian” or “Military”. If “IsMilitary” equals to 1, that means that flight operator is a Military institution, otherwise it is a Civilian flight.

military_keywords <- c("Military", "Army", "Navy")
apc_clean$IsMilitary <- ifelse(grepl(paste(military_keywords, collapse = "|"), apc_clean$Operator),1,0)

# Number of Military Flights
sum(apc_clean$IsMilitary)
## [1] 814

Wonderful. Let’s find out how many passengers survived for each crash.

apc_clean$Survived <- apc_clean$Aboard - apc_clean$Fatalities
head(apc_clean$Survived)
## [1]  1  0  0  6  0 20

We can determine Source, Destination and number of stops from “Route” column.

apc_clean$Source <- gsub(" -.*", "", apc_clean$Route)
apc_clean$Destination <- gsub(".* -", "", apc_clean$Route)
apc_clean$Stops <- str_count(apc_clean$Route,"-")-1

# Number of flights with no Stops
length(which(apc_clean$Stops==0))
## [1] 3262
# Number of flights with no Route information
length(which(apc_clean$Stops<0))
## [1] 1954
# Number of flights with Stops
length(which(apc_clean$Stops>0))
## [1] 327

Lastly, we reordered the columns in apc_clean data frame.

apc_clean <- apc_clean[,c(1,2,9,10,11,12,3,13,4,15,16,17,5,6,7,14,8)]

# Latest structure of apc_clean
str(apc_clean)
## 'data.frame':    5543 obs. of  17 variables:
##  $ Date       : Date, format: "1908-09-17" "1912-07-12" ...
##  $ Time       : Factor w/ 32 levels "","00","01","02",..: 21 8 1 22 14 3 19 1 3 29 ...
##  $ Year       : Factor w/ 106 levels "1908","1912",..: 1 2 3 3 3 4 4 5 5 5 ...
##  $ Month      : Factor w/ 12 levels "01","02","03",..: 9 7 8 9 10 3 9 7 9 10 ...
##  $ State      : chr  "Virginia" "New Jersey" "Canada" "Over the North Sea" ...
##  $ City       : chr  "Fort Myer" "AtlantiCity" "Victoria" "Over the North Sea" ...
##  $ Operator   : chr  "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
##  $ IsMilitary : num  1 1 0 1 1 1 1 1 1 1 ...
##  $ Route      : Factor w/ 3470 levels "","Abaco - Miami",..: 830 2982 1 1 1 1 1 1 1 1 ...
##  $ Source     : chr  "Demonstration" "Test flight" "" "" ...
##  $ Destination: chr  "Demonstration" "Test flight" "" "" ...
##  $ Stops      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ Type       : Factor w/ 2614 levels "","AAC-1 Toucan",..: 2419 1145 1024 2434 2437 2446 2433 2148 2439 2438 ...
##  $ Aboard     : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Fatalities : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Survived   : int  1 0 0 6 0 20 0 0 0 0 ...
##  $ Summary    : chr  "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...

You can find references for preprocessing from here.

Exploratory Analytics of APC

Explaining the Dataset & Variables

Dataset: Airplane crashes from 1908 to 2017. Contains 17 variables, with 5.543 observations.

Here are the variables in our Airplane Crashes dataset and their explanations:

  • Date: The date of airplane crash.
  • Time: The time of airplane crash in hh format.
  • Year: The year of airplane crash.
  • Month: The month of airplane crash.
  • State: This field shows the country in which airplane crash happened.
  • City: This field shows the city in which airplane crash occured.
  • Operator: This field contains the airline information.
  • IsMilitary: This field shows whether an operator is a military institution or civilian. It conveys binary results (1=Military, 0=Civilian).
  • Route: This field contains the departure, arrival and transfer location of the flight. There are both direct and connecting flights.
  • Source: This field shows the city where the airplane took-off.
  • Destination: This field shows the destination city.
  • Stops: This field demonstrates the number of transfer cities for connecting flights.
  • Type: This field contains the information of the airplane type.
  • Aboard: This field shows the number of passengers at the time of departure.
  • Fatalities: This field shows the number of passengers died after airplane crash.
  • Survived: This field shows the number of passengers survived after airplane crash.
  • Summary: This field contains information about the airplane crash and possible reasons.

Objectives

  • Understanding statistics and characteristics of past airplane crashes

  • Visualizing results of exploratory analysis

  • Demonstrating geolocations of airplane crashes on a static map

  • Developing and demonstrating R & RMarkdown skills of R_Junkies

Questions

We studied most of the kernels on Kaggle. We hope that at the end of our research, we can share our findings on Kaggle. :)

  • Question 1: How many passengers died in airplane crashes since 1908?
  • Question 2: Which airplanes have most records of crashes so far?
  • Question 3: How many accidents occured, compared to number of departures over years?
  • Question 4: Which civil operators have the most records of accidents and fatalities?
  • Question 5: What is the crash history of Turkish Airlines?
  • Question 6: Have airplane flights gotten more secure over time?
  • Question 7: Has survival rate of a single airplane crash increased over time?
  • Question 8: What are the causes of airplane crashes?
  • Question 9: In which countries the most airplane accidents occured?

EDA

Total Fatalities Over Years

Let’s find out how many passengers died over years.

total_fatalities <- apc_clean %>% 
  filter(IsMilitary==0) %>% 
  group_by(Year) %>%
  summarize(total_fat=sum(Fatalities)) %>%
  mutate(total_fat = ifelse(is.na(total_fat), 0, total_fat))


ggplot(total_fatalities, aes(x=Year,y=total_fat,group = 1)) +
  geom_line(aes(y=total_fat), size=0.5, alpha=1, colour="#7a1c1c") +
  geom_area(fill = "lightblue") +
  ggtitle("Fatalities of Civil Flights Over Years") +
  geom_text(aes(label = paste(as.character(round(total_fat/1000,1)), "K")), check_overlap = TRUE, size=2.7, colour = "black", angle=30,nudge_y = -40) +
  scale_x_discrete(breaks = levels(total_fatalities$Year)[c(T, rep(F, 5))]) +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
  theme(plot.title = element_text(size=11),
         axis.text=element_text(size=8)) +
  labs(x = "Years", y = "Fatalities")

As we can see, until 90’s the trend increases, but later it seems to decline: Less passengers died in airplane crashes.

Top 10 Airplane Types & Crashes

top_ten_plane <- apc_clean %>%
                  group_by(Type) %>%
                  summarise(Freq = n()) %>%
                  arrange(desc(Freq)) %>% 
                  top_n(10)
## Selecting by Freq
top_ten_plane$Type <- substr(top_ten_plane$Type,1,19)

top_ten_plane
## # A tibble: 10 x 2
##    Type                 Freq
##    <chr>               <int>
##  1 Douglas DC-3          334
##  2 de Havilland Canada    86
##  3 Douglas C-47A          74
##  4 Douglas C-47           62
##  5 Douglas DC-4           40
##  6 Yakovlev YAK-40        37
##  7 Antonov AN-26          36
##  8 Junkers JU-52/3m       32
##  9 Douglas C-47B          29
## 10 De Havilland DH-4      28
top_ten_mil_plane <- apc_clean %>%
                  filter(IsMilitary==1) %>%
                  group_by(Type) %>%
                  summarise(Freq = n()) %>%
                  arrange(desc(Freq)) %>% 
                  top_n(10)
## Selecting by Freq
# Slicing long strings
top_ten_mil_plane$Type <- substr(top_ten_mil_plane$Type,1,19)

top_ten_mil_plane
## # A tibble: 10 x 2
##    Type                 Freq
##    <chr>               <int>
##  1 Douglas C-47           27
##  2 Lockheed C-130H        16
##  3 Antonov AN-26          15
##  4 Boeing KC-135A         15
##  5 Douglas C-47B          14
##  6 Lockheed C-130E Her    14
##  7 Lockheed C-130H Her    13
##  8 Douglas C-47A          12
##  9 Antonov AN-12          10
## 10 Mil Mi-8 (helicopte     9
top_ten_civ_plane <- apc_clean %>%
                  filter(IsMilitary==0) %>%
                  group_by(Type) %>%
                  summarise(Freq = n()) %>%
                  arrange(desc(Freq)) %>% 
                  top_n(10)
## Selecting by Freq
# Slicing long strings
top_ten_civ_plane$Type <- substr(top_ten_civ_plane$Type,1,19)

top_ten_civ_plane
## # A tibble: 10 x 2
##    Type                 Freq
##    <chr>               <int>
##  1 Douglas DC-3          333
##  2 de Havilland Canada    82
##  3 Douglas C-47A          62
##  4 Douglas DC-4           39
##  5 Yakovlev YAK-40        36
##  6 Douglas C-47           35
##  7 Junkers JU-52/3m       32
##  8 De Havilland DH-4      28
##  9 Cessna 208B Grand C    27
## 10 Douglas DC-6B          26
ggplot(top_ten_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
  geom_bar(stat = "identity", fill = "#2780E3") +
  labs(title="Top 10 Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1))

top_ten_mil_plane_plot <- ggplot(top_ten_mil_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
                            geom_bar(stat = "identity", fill = "#2780E3") +
                            labs(title="Top 10 Military Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
                            theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
                            theme(plot.title = element_text(size=11),
                                  axis.text=element_text(size=8))


top_ten_civ_plane_plot <- ggplot(top_ten_civ_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
                            geom_bar(stat = "identity", fill = "#2780E3") +
                            labs(title="Top 10 Civilian Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
                            theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
                            theme(plot.title = element_text(size=11),
                                  axis.text=element_text(size=8))

grid.arrange(top_ten_mil_plane_plot, top_ten_civ_plane_plot, ncol=2)

Comparing # of Accidents & # of Departures

# Total departures in Million

total_dprt <- 
  air_dprt %>%
  filter(Time >= 1970 & Time <= 2016) %>%
  mutate(dep_val = ifelse(is.na(Value), 0, Value)) %>% 
  group_by(Time) %>% summarize(total_dep=sum(dep_val)/1000000)

# # of Accidents and Fatalities

Acc_fat_data <- apc_clean %>% filter(as.numeric(as.character(Year)) >= 1970 & as.numeric(as.character(Year)) <= 2016) %>%  group_by(Year)%>% summarise(n=sum(ifelse(is.na(Aboard), 0, Aboard)),f=sum(ifelse(is.na(Fatalities), 0, Fatalities)))

# Combining Departures and Accident-Fatality Data

dep_acc_fat <- data.frame(year=total_dprt$Time,
                Departure = total_dprt$total_dep,
                Accident = Acc_fat_data$n,
                Fatality = Acc_fat_data$f,
                Rate = Acc_fat_data$n/total_dprt$total_dep)

# Normalizer for plot

normalizer <- max(dep_acc_fat$Accident)/max(dep_acc_fat$Departure)

ggplot(dep_acc_fat, aes(y=Accident/normalizer, x=year)) +  
  geom_col(aes(y = Departure), fill = "#2780E3") + 
  geom_line(size=1, alpha=1, colour = "#7a1c1c") +
  ggtitle("Departures [Bar] & Accidents [Line]") +
  scale_y_continuous(sec.axis = sec_axis(trans= ~.*normalizer, name = 'Departures')) +
  theme(axis.text.y = element_text(colour="black", size=12), 
    axis.text.x=element_text(angle=60, vjust=1, hjust=1, size=8),
    axis.title=element_text(colour="black", size=12),
    legend.position = "left" ) +
  labs(y="Accidents", x="Year")

Top Civil Operators’ Crashes & Fatalities Over Years

top_six_civ_op_crash <- apc_clean %>%
                        filter(IsMilitary==0) %>%
                        group_by(Operator) %>%
                        summarise(SumCrashes= n()) %>%
                        arrange(desc(SumCrashes))  %>%
                        top_n(6)
## Selecting by SumCrashes
top_six_civ_op_fat <- apc_clean %>%
                        filter(IsMilitary==0) %>%
                        group_by(Operator) %>%
                        summarise(SumFats= sum(Fatalities)) %>%
                        arrange(desc(SumFats))  %>%
                        top_n(6)
## Selecting by SumFats
top_six_civ_op_crash_oy <- apc_clean %>%
                            filter(Operator %in% top_six_civ_op_crash$Operator)  %>%
                            group_by(Operator, Year) %>%
                            summarise(SumFatalities= sum(Fatalities)) %>%
                            arrange(desc(Year), desc(SumFatalities))

top_six_civ_op_fat_oy <- apc_clean %>%
                            filter(Operator %in% top_six_civ_op_fat$Operator)  %>%
                            group_by(Operator, Year) %>%
                            summarise(SumFatalities= sum(Fatalities)) %>%
                            arrange(desc(Year), desc(SumFatalities))

top_six_civ_op_crash 
## # A tibble: 6 x 2
##   Operator                            SumCrashes
##   <chr>                                    <int>
## 1 Aeroflot                                   179
## 2 Air France                                  70
## 3 Deutsche Lufthansa                          65
## 4 Air Taxi                                    44
## 5 China National Aviation Corporation         44
## 6 United Air Lines                            44
top_six_civ_op_fat
## # A tibble: 6 x 2
##   Operator                   SumFats
##   <chr>                        <int>
## 1 Aeroflot                      7156
## 2 American Airlines             1421
## 3 Pan American World Airways    1302
## 4 United Air Lines              1021
## 5 AVIANCA                        941
## 6 Turkish Airlines (THY)         891
top_six_civ_op_crash_oy
## # A tibble: 157 x 3
## # Groups:   Operator [6]
##    Operator         Year  SumFatalities
##    <chr>            <fct>         <int>
##  1 Air France       2009            228
##  2 Aeroflot         2008             88
##  3 Air France       2005              0
##  4 Air Taxi         2004              6
##  5 Air Taxi         2002              4
##  6 United Air Lines 2001            109
##  7 Air Taxi         2001              6
##  8 Air France       2000            109
##  9 Air Taxi         2000              3
## 10 Air France       1998             53
## # ... with 147 more rows
top_six_civ_op_fat_oy
## # A tibble: 151 x 3
## # Groups:   Operator [6]
##    Operator               Year  SumFatalities
##    <chr>                  <fct>         <int>
##  1 Aeroflot               2008             88
##  2 Turkish Airlines (THY) 2003             75
##  3 American Airlines      2001            416
##  4 United Air Lines       2001            109
##  5 American Airlines      2000              1
##  6 American Airlines      1999             11
##  7 Turkish Airlines (THY) 1999              6
##  8 United Air Lines       1997              1
##  9 Aeroflot               1996              2
## 10 American Airlines      1995            160
## # ... with 141 more rows
ggplot(top_six_civ_op_crash_oy, aes(x=Year)) +
  geom_line(aes(y=SumFatalities, group = top_six_civ_op_crash_oy$Operator, colour=top_six_civ_op_crash_oy$Operator), size=0.5, alpha=1) +
  ggtitle("Fatalities of Top 6 Operators with Most Crashes Over Years") +
  scale_x_discrete(breaks = levels(top_six_civ_op_crash_oy$Year)[c(T, rep(F, 3))]) +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
  theme(plot.title = element_text(size=11),
         axis.text=element_text(size=8),
        legend.position="bottom") +
  labs(x = "Years", y = "Fatalities", colour = "Operators")

ggplot(top_six_civ_op_fat_oy, aes(x=Year)) +
  geom_line(aes(y=SumFatalities, group = top_six_civ_op_fat_oy$Operator, colour=top_six_civ_op_fat_oy$Operator), size=0.5, alpha=1) +
  ggtitle("Fatalities of Top 6 Operators with Most Fatalities Over Years") +
  scale_x_discrete(breaks = levels(top_six_civ_op_fat_oy$Year)[c(T, rep(F, 3))]) +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
  theme(plot.title = element_text(size=11),
         axis.text=element_text(size=8),
        legend.position="bottom") +
  labs(x = "Years", y = "Fatalities", colour = "Operators")


Update

One of the feedbacks we had is that the line graphs “Fatalities of Top 6 Operators with Most Crashes/Fatalities Over Years” are not so easy to interpret. For that reason, we decided to convert them to a new heat map and show fatalities of top 10 civil operators with most fatalities over decades. We will generate a new data frame from our clean data using dplyr, and visualise it with ggplot2.

Now we generate a new data frame, ‘hm_df’ from apc_clean. hm_df will contain Operator, YearRange, Fatalities. And we need complete cases for visualization.

hm_df <- apc_clean %>%
          select(Operator, Year, Fatalities, IsMilitary) %>%
          filter(IsMilitary == 0) %>%
          select(Operator, Year, Fatalities) %>%
          filter(Operator != "")

hm_df <- hm_df[complete.cases(hm_df), ]

Now we need to generate a new column, ‘YearRange’ in order to find decades. We will use group_by.

hm_df <- hm_df %>%
    mutate(Year = as.numeric(as.character(Year))) %>%
    mutate(YearRange = paste( as.character( Year - (Year %% 10)),
                              as.character(Year - (Year %% 10) + 9),
                              sep = " - ")) %>%
    select(-Year) %>%
    group_by(Operator,YearRange) %>%
    summarise(Fatalities = sum(Fatalities)) %>%
    filter(Fatalities > 0) %>%
    arrange(Operator, YearRange)

We want to show only top 10 operators with most fatalities. So we are going to find them. We will name it as civ_op_topten_fat.

civ_op_topten_fat <- apc_clean %>%
                      filter(IsMilitary == 0) %>%
                      select(Operator, Fatalities) %>%
                      group_by(Operator) %>%
                      summarise(Fatalities = sum(Fatalities)) %>%
                      arrange(desc(Fatalities)) %>%
                      slice(1:10) %>%
                      select(Operator)

Beautiful. Let’s join hm_df with civ_op_topten_fat!

hm_df <- hm_df %>% inner_join(., civ_op_topten_fat, by='Operator')
head(hm_df)
## # A tibble: 6 x 3
## # Groups:   Operator [1]
##   Operator YearRange   Fatalities
##   <chr>    <chr>            <int>
## 1 Aeroflot 1940 - 1949         24
## 2 Aeroflot 1950 - 1959        414
## 3 Aeroflot 1960 - 1969        951
## 4 Aeroflot 1970 - 1979       3512
## 5 Aeroflot 1980 - 1989       1814
## 6 Aeroflot 1990 - 1999        353

Since our data frame is ready for visualization, we can start using ggplot2.

ggplot(hm_df, aes(YearRange, Operator)) +
  theme_classic() +
  geom_tile(aes(fill = Fatalities), colour = "black") +
  scale_fill_gradient(low = "#d0dee8", high = "#152a3a") +
  theme(panel.background = element_rect(fill = "#e3e9ed", colour = "black")) +
  labs(title='Fatalities of Top 10 Civil Operators Over Decades [With Aeroflot]', x='Decades', y='Operators') +
  theme(plot.title = element_text(hjust = 0.0, size=11)) +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size = 8))

Not so good! The variation between fatalities can’t be observed. When we searched for a reason, we figured out that there is an outlier: Aeroflot! In 70s they got more than 3K fatalities! It is very sad story for Aeroflot. Anyways, we decided to remove Aeroflot from visualization and generated a new heatmap.

hm_df <- hm_df %>% filter(Operator != 'Aeroflot')

ggplot(hm_df, aes(YearRange, Operator)) +
  theme_classic() +
  geom_tile(aes(fill = Fatalities), colour = "black") +
  scale_fill_gradient(low = "#d0dee8", high = "#152a3a") +
  theme(panel.background = element_rect(fill = "#e3e9ed", colour = "black")) +
  labs(title='Fatalities of Top 10 Civil Operators Over Decades [Without Aeroflot]', x='Decades', y='Operators') +
  theme(plot.title = element_text(hjust = 0.0, size=11)) +
  theme(axis.text.x=element_text(angle = 45, hjust = 1, size = 8))

The heatmap is now better than the line graph we used early. As we have discussed in class, between 70s and 90s the world experienced lots of airplane crashes with lots of fatalities! But now this trend reverted to a decline. (Don’t overlook THY in 70s! There is a terrible crash in 1974.)

Crash History of Turkish Airlines

We wondered Turkish Airlines’ accidents and fatalities over time.

thy <- apc_clean %>%
        filter(grepl("Turkish",Operator) & IsMilitary ==0) %>%
        group_by(Operator, Year, Month) %>%
        summarize(TotalAboard = sum(Aboard), TotalFatalities = sum(Fatalities)) %>%
        arrange(Year, Month) %>%
        mutate(MonthName = month.abb[Month]) %>%
        select(-Month)

p <- plot_ly(thy, x = ~Year, y = ~TotalAboard, type = 'scatter', mode = 'markers', size = ~TotalFatalities, color = ~MonthName, colors = 'Paired',
        marker = list(opacity = 1, sizemode = 'diameter')) %>%
  layout(title = 'Turkish Airlines Stats Over Years',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE),
         showlegend = TRUE)

p

Fatalities Over Total Passengers By Years

The ratio of fatalities over total passengers is important. We want to see the relation between passengers carried and passengers died over the time.

Total_Passenger_Byyears <- air_psgr %>% 
                   filter(!is.na(Value)) %>% 
                   group_by(Time) %>% 
                   summarize(Value=sum(Value)) %>%
                  mutate(Year = as.character(Time),Passengers = Value) %>%
                  select(-Time,-Value)

Total_Fatalities_Byyears <- apc_clean %>% 
            filter(IsMilitary==0) %>% 
            group_by(Year) %>%
            summarize(TotalFatalities = sum(Fatalities)) %>%
            mutate(TotalFatalities = ifelse(is.na(TotalFatalities), 0, TotalFatalities))

Total_Fatalities_Byyears$Year = as.character(Total_Fatalities_Byyears$Year)


death <- inner_join(Total_Passenger_Byyears,Total_Fatalities_Byyears, by = "Year")

death <-death %>%
            mutate(DRate = TotalFatalities/Passengers*1000000)


ggplot(death, aes(x=Year,y=DRate,group = 1)) +
  geom_line(aes(y=DRate), size=0.5, alpha=1, colour="#7a1c1c") +
  ggtitle("Fatalities Over Total Passengers By Years") +
  scale_x_discrete(breaks = levels(total_fatalities$Year)[c(T, rep(F, 5))]) +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
  theme(plot.title = element_text(size=11),
         axis.text=element_text(size=8)) +
  labs(x = "Years", y = "Rate")

It is very clear that more passengers were carried, but less passengers died over years. It’s a good sign of increase in flight security.

Has Survival Rate Increased Over Time?

# Survival Rate
dataSurvival <- apc_clean %>% 
   filter(as.numeric(as.character(Year)) >= 1900 & Aboard > 0) %>%
   group_by(Year) %>%
   summarize(TotalFatalities=sum(Fatalities), 
             TotalAboard=sum(Aboard),
             SurvivalRate=round(100*(TotalAboard-TotalFatalities)/TotalAboard,2),
             Survival = (TotalAboard-TotalFatalities),
             TotalAccident = n())
 
# Normalizer for dual-y axis
n <- max(dataSurvival$TotalAccident) /max(dataSurvival$SurvivalRate)
 
ggplot(dataSurvival, aes(y=SurvivalRate, x=Year)) +  
   geom_col(aes(y=TotalAccident/n), fill = "#2780E3") +
   geom_line(aes(group = 1), size=0.7, alpha=1, colour = "#7a1c1c") +
   geom_smooth(aes(group = 1), colour = "black") +
   scale_y_continuous(sec.axis = sec_axis(trans= ~.*n, name = 'Total Accidents')) + 
   scale_x_discrete(breaks = levels(dataSurvival$Year)[c(T, rep(F, 3))]) +
   
   theme(axis.text.x=element_text(angle=60, vjust=1, hjust=1, size=8))
## `geom_smooth()` using method = 'loess'

We observed that “Survival Rate” was very low in early 1900’s and increased over time. The trend reverts after beginning of millenium.

Possible Causes of Crashes - Word Cloud

For each accident, there is an explanation in “Summary” column. By looking at this summary information, we tried to visuailize a word cloud.

 text <- apc_clean$Summary
 
 docs <- Corpus(VectorSource(text))
 
 toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
 docs <- tm_map(docs, toSpace, "/")
 docs <- tm_map(docs, toSpace, "@")
 docs <- tm_map(docs, toSpace, "\\|")
 
 docs <- tm_map(docs, PlainTextDocument)
 
 # Convert text to lower case
 docs <- tm_map(docs, content_transformer(tolower))
 
 # Remove numbers
 docs <- tm_map(docs, removeNumbers)
 
 # Remove english common stopwords
 docs <- tm_map(docs, removeWords, stopwords("english"))
 
 # Remove punctuations
 docs <- tm_map(docs, removePunctuation)
 
 # Eliminate extra white spaces
 docs <- tm_map(docs, stripWhitespace)
 
 # Text stemming
 docs <- tm_map(docs, stemDocument)
 
 # Remove possible words that high frequency in text that no meaning for crash reason
 # (After first try in word cloud, we realize that these words are the most frequent one)
 docs <- tm_map(docs, removeWords, c("aircraft", "plane","flight", "crash","crashed")) 
 
 dtm <- TermDocumentMatrix(docs)
 
 ##Remove sparse terms
 removeSparseTerms(dtm, 0.95)
## <<TermDocumentMatrix (terms: 38, documents: 5543)>>
## Non-/sparse entries: 20279/190355
## Sparsity           : 90%
## Maximal term length: 8
## Weighting          : term frequency (tf)
 m <- as.matrix(dtm)
 v <- sort(rowSums(m),decreasing=TRUE)
 d <- data.frame(word = names(v),freq=v)
 #head(d, 10)
 
 set.seed(1234)
 wordcloud(words = d$word, freq = d$freq, min.freq = 1,
           max.words=200, random.order=FALSE, rot.per=0.35, 
           colors=brewer.pal(8, "Dark2"))

Most Frequent 10 Words

Here are the most frequent 10 words, which may show the reason why the accident happened.

#Top 10 Word frequencies
ggplot(d[1:10,], aes(x=reorder(word, -freq), y=freq)) +
  geom_bar(stat = "identity", fill = "#2780E3") +
  labs(title="Top 10 Word frequencies",x="Words",y="Frequency",fill="") +
  theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1))

#How words associated with others. so we can understand causes 
i <- 1

while (i < 6) {
  print (findAssocs(dtm, terms = as.character(d[i,]$word), corlimit = 0.3))
  i = i+1
}
## $pilot
## numeric(0)
## 
## $land
## attempt    gear 
##    0.49    0.36 
## 
## $engin
## failur  power 
##   0.34   0.31 
## 
## $approach
## final 
##  0.33 
## 
## $runway
##     end overran 
##    0.31    0.30

Where do accidents occur?

We prepare a world map that shows dangerous countries for flights.

# Crashes
apc_civilcrashes <- apc_clean %>%
                    select(State) %>%
                    filter(apc_clean$IsMilitary==0) %>%
                    group_by(State) %>%
                    summarise(Totalcrash=n())

apc_civilcrashes$State <- ifelse(apc_civilcrashes$State %in% state.name,"United States",apc_civilcrashes$State)

apc_civilcrashes <- apc_civilcrashes %>%
                    group_by(State) %>%
                    summarise(Totalcrash=sum(Totalcrash))

sPDF <- joinCountryData2Map(apc_civilcrashes, joinCode="NAME", nameJoinColumn="State")
## 175 codes from your data successfully matched countries in the map
## 226 codes from your data failed to match with a country code in the map
## 71 codes from the map weren't represented in your data
mapParams<-mapCountryData(sPDF
                          , nameColumnToPlot = "Totalcrash"
                          , mapTitle= "Most Dangerous Countries by Crashes"
                          , colourPalette=brewer.pal(9, "YlOrRd")
                          , missingCountryCol = "white"
                          , oceanCol = "lightsteelblue2"
                          , numCats=9
                          , catMethod="fixedWidth"
)
do.call(addMapLegend, c(mapParams
                        , legendLabels="all"
                        , legendIntervals="page"
))

# Fatalities
apc_civilfat <- apc_clean %>%
                  select(State,Fatalities) %>%
                  filter(apc_clean$IsMilitary==0) %>%
                  group_by(State) %>%
                  summarise(TotalFat=sum(Fatalities))

apc_civilfat$State <- ifelse(apc_civilfat$State %in% state.name,"United States",apc_civilfat$State)

apc_civilfat <- apc_civilfat %>%
                group_by(State) %>%
                summarise(TotalFat=sum(TotalFat))

sPDF <- joinCountryData2Map(apc_civilfat, joinCode="NAME", nameJoinColumn="State")
## 175 codes from your data successfully matched countries in the map
## 226 codes from your data failed to match with a country code in the map
## 71 codes from the map weren't represented in your data
mapParams<-mapCountryData(sPDF
                          , nameColumnToPlot = "TotalFat"
                          , mapTitle= "Most Dangerous Countries by Fatalities"
                          , colourPalette=brewer.pal(9, "YlOrRd")
                          , missingCountryCol = "white"
                          , oceanCol = "lightsteelblue2"
                          , numCats=9
                          , catMethod="fixedWidth"
)
do.call(addMapLegend, c(mapParams
                        , legendLabels="all"
                        , legendIntervals="page"
))

USA, Russia and Brazil seem to be the most dangerous places not to travel via plane!

Conclusion

We tried to answer 9 questions prepared for airplane crashes dataset. It seems that there are less airplane crashes nowadays. We hope that they will never happen in near & far future. :)

You are welcome to ask any questions or give suggestions. Thanks to Berk Orbay for his contributions and his instant replies. He enabled us to develop our coding skills.

Future Research Suggestions

  • More cleaning of “Route”, “Source” & “Destination” variables, using advanced string manipulation, since there are some edge cases
  • More cleaning of “State” variable, since we can’t show all crashes on world map
  • Adding “Ground” variable (Non-passenger people died on the ground) to analysis
  • Dynamic Map of Airplane Crashes, using Shiny
  • Most dangerous routes as lines on world-map
  • Predicting crash possibility of a flight using Machine Learning tools

References

Data Preprocessing & Cleaning

We understood how to differentiate whether an operator is Military institution or a Civilian operator from adhok93’s kernel.

We got a lot of help from danielviray’s kernel in order to accomplish data preprocessing, but we used a different approach and generated our own columns.