This project, by R_Junkies, is about past airplane crashes, from 1908 to 2017. The project is prepared specifically for BDA 503, 2017, MEF University; course given by Berk Orbay.
We call ourselves: R_Junkies. It represents our humble interest in Data Science. It’s a life-style.
Group members are below (Ladies first):
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Here is the list of datasets and sources we will be using in this project:
Our dataset taken from Kaggle contains crashes from 1908 to mid 2009. In order to analyze data more accurately, we crawled dataset from http://www.planecrashinfo.com/. With the following Python code, we created a new file contains only data from 2009 to 2017.
from bs4 import BeautifulSoup
import re
from urllib import request
import datetime
import csv
# Request URL and return response as BeautifulSoup object
def makeBeautifulSoupObject(url):
requestConn = request.urlopen(url)
responseHTML = requestConn.read()
requestConn.close()
soup = BeautifulSoup(responseHTML, "lxml")
soup.decode(eventual_encoding="UTF-8")
return soup
def parseHTML(table_):
record = {}
table = BeautifulSoup(str(table_[0]), 'html.parser')
for tr in table.find_all("tr")[1:]:
tds = tr.find_all("td")
# encoding the 'value' string to utf-8 and removing any non-breaking space (HTML Character)
tmp_str = tds[1].string # .string.encode('utf-8').replace(" ", "")
value = str(tmp_str) # this is the value- In Column #2 of the HTML table
key = tds[0].string # this is the key- In Column #1 of the HTML table
# print(tds[0])
if key == "Date:":
dat = str(value).replace(',', '')
date = datetime.datetime.strptime(dat, '%B %d %Y')
record["date"] = date
elif key == "Time:":
if not value == '?':
time = re.sub("[^0-9]", "", value)
record["time"] = time[0:2] + ":" + time[2:4]
else:
record["time"] = ''
elif key == "Location:":
if not value == '?':
record["loc"] = str(value)
else:
record["loc"] = ''
elif key == "Operator:":
if not value == '?':
record["op"] = str(value)
else:
record["op"] = ''
elif key == "Flight #:":
if not value == '?':
record["flight"] = str(value)
else:
record["flight"] = ''
elif key == "Route:":
if not value == '?':
record["route"] = str(value)
else:
record["route"] = ''
elif key == "Registration:":
if not value == '?':
record["reg"] = str(value).encode("utf-8")
else:
record["reg"] = ''
elif key == "cn / ln:":
if not value == '?':
record["cnln"] = str(value)
else:
record["cnln"] = ''
elif key == "Aboard:":
if not value == '?':
s = ' '.join(value.split())
aboard_ = s.replace('(', '').replace(')', '').split(' ')
if aboard_[0] != '?':
record["aboard_total"] = aboard_[0]
else:
record["aboard_total"] = 'NULL'
passengers = aboard_[1].replace("passengers:", "")
if passengers != '?':
record["aboard_passengers"] = passengers
else:
record["aboard_passengers"] = 'NULL'
crew = aboard_[2].replace("crew:", "")
if crew != '?':
record["aboard_crew"] = crew
else:
record["aboard_crew"] = 'NULL'
else:
record["aboard_total"] = 'NULL'
record["aboard_passengers"] = 'NULL'
record["aboard_crew"] = 'NULL'
elif key == "Fatalities:":
if not value == '?':
s = ' '.join(value.split())
fatalities_ = s.replace('(', '').replace(')', '').split(' ')
if fatalities_[0] != '?':
record["fatalities_total"] = fatalities_[0]
else:
record["fatalities_total"] = 'NULL'
passengers = fatalities_[1].replace("passengers:", "")
if passengers != '?':
record["fatalities_passengers"] = passengers
else:
record["fatalities_passengers"] = 'NULL'
crew = fatalities_[2].replace("crew:", "")
if crew != '?':
record["fatalities_crew"] = crew
else:
record["fatalities_crew"] = 'NULL'
else:
record["aboard_total"] = 'NULL'
record["aboard_passengers"] = 'NULL'
record["aboard_crew"] = 'NULL'
elif key == "Ground:":
if not value == '?':
record["ground"] = str(value)
else:
record["ground"] = 'NULL'
elif key == "Summary:":
if not value == '?':
record["summary"] = str(value).replace('\n','')
else:
record["summary"] = ''
elif key == "AC Type:":
if not value == '?':
record["actype"] = str(value)
else:
record["actype"] = ''
else:
st1 = ''.join(tds[0].string.split()).lower()
st1 = st1.replace(':', '')
if not value == '?':
record[st1] = str(value)
else:
record[st1] = "NULL"
return record
### Main Program ###
# Root URL
rooturl = "http://www.planecrashinfo.com"
# From which year data is taken
start_year = 2009
# to which year data is taken
end_year = 2009
# Create a new file
f = csv.writer(open("Airplane_Crashes_and_Fatalities_Since_2009.csv", "w", newline=''), delimiter=',')
# Header info
f.writerow(["Date", "Time", "Location", "Operator", "Flight #", "Route", "Type", "Registration", "cn/In", "Aboard", "Fatalities", "Ground", "Summary"])
for i in range(start_year, end_year + 1, 1):
year_start = datetime.datetime.utcnow()
# appending the path (year) to the url hostname
# Sample page : http://www.planecrashinfo.com/2008/2008.htm
newurl = rooturl + "/" + str(i) + "/" + str(i) + ".htm"
#Get the main page for each year
soup = makeBeautifulSoupObject(newurl)
tables = soup.find_all('table')
#Each page contains the sumamry crash info
for table in tables:
# finding the no. of records for the given year
number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table))
number_of_rows = 3
for j in range(1, number_of_rows, 1):
# appending the row number to sub-path of the url, and building the final url that will be used for sending http request
#Sample : http://www.planecrashinfo.com/2008/2008-1.htm
accident_url = newurl.replace(".htm", "") + "-" + str(j) + ".htm"
web_record = makeBeautifulSoupObject(accident_url)
# removing all the boilerplate html code except the data table
table_details = web_record.find_all('table')
#Parse table and return crash record
crashRecord = parseHTML(table_details)
#Write crash record to file
f.writerow([crashRecord["date"].strftime('%m/%d/%Y'),
crashRecord["time"],
crashRecord["loc"],
crashRecord["op"],
crashRecord["flight"],
crashRecord["route"],
crashRecord["actype"],
crashRecord["reg"],
crashRecord["cnln"],
crashRecord["aboard_total"],
# record["aboard_passengers"],
# record["aboard_crew"],
crashRecord["fatalities_total"],
# record["fatalities_passengers"],
# record["fatalities_crew"],
crashRecord["ground"],
crashRecord["summary"]
])
# Original Kaggle file
apc_raw_to_2009 <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/Airplane_Crashes_and_Fatalities_Since_1908.csv", header=TRUE, sep=",")
# Data we crawled from PlaneCrashInfo.com
apc_raw_from_2009 <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/Airplane_Crashes_and_Fatalities_Since_2009.csv", header=TRUE, sep=",")
air_psgr <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/is_air_psgr_melt.csv", header=TRUE, sep=",", quote="\"", na.strings="NA")
air_dprt <- read.csv(file="https://raw.githubusercontent.com/MEF-BDA503/gpj-rjunkies/master/files/project_data/is_air_dprt_melt.csv",header=TRUE, sep=",", quote="\"", na.strings="NA")
Here are the necessary libraries. We checked if necessary libraries are installed properly.
necessary_lib <- c("stringr","tidyverse","ggthemes","grid","gridExtra","scales","tm","SnowballC","wordcloud","RColorBrewer","plotly","rworldmap")
# Lets check these necessary libraries against already installed packages.
install_lib <- necessary_lib[!(necessary_lib %in% installed.packages())]
if(length(install_lib)!=0){
install.packages(install_lib)
}
Load necessary libraries
lapply(necessary_lib, require, character.only = TRUE)
Next, we merge the dataset from Kaggle and the dataset we crawled from PlaneCrashInfo.
Kaggle dataset contains duplicate values for 2009 crashes, so we omit them.
apc_raw_to_2009 <- apc_raw_to_2009 %>%
mutate(year=format(as.Date(apc_raw_to_2009$Date, format="%m/%d/%Y"),"%Y")) %>%
filter(as.numeric(year) < 2009) %>%
select(-year)
# Binding datasets
apc_raw <- rbind(apc_raw_to_2009, apc_raw_from_2009)
We first create a new data frame called “apc_clean” which we will manipulate for our analysis.
# New data frame
apc_clean <- apc_raw
# Dimensions
dim(apc_clean)
## [1] 5543 13
# Get an idea of apc_clean columns
str(apc_clean)
## 'data.frame': 5543 obs. of 13 variables:
## $ Date : Factor w/ 5014 levels "01/01/1966","01/01/1970",..: 3297 2372 2699 3184 3682 852 3099 2585 3387 3469 ...
## $ Time : Factor w/ 1035 levels "","00:00","00:01",..: 702 200 1 757 384 36 609 1 36 985 ...
## $ Location : Factor w/ 4530 levels "","100 miles SW of Kuujjuaq, Quebec, Canada",..: 849 123 4179 3388 2291 4058 3113 2285 283 3550 ...
## $ Operator : Factor w/ 2708 levels "","Aaxico Airlines",..: 1577 1590 1825 1466 1466 1466 1466 1465 1466 1466 ...
## $ Flight.. : Factor w/ 779 levels "","-","002","004",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Route : Factor w/ 3470 levels "","Abaco - Miami",..: 830 2982 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 2614 levels "","AAC-1 Toucan",..: 2419 1145 1024 2434 2437 2446 2433 2148 2439 2438 ...
## $ Registration: Factor w/ 5184 levels ""," / ","01 / 02 / 03",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cn.In : Factor w/ 3913 levels "","0001","0001A",..: 170 1 1 1 1 1 1 1 1 1 ...
## $ Aboard : int 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : int 1 5 1 14 30 21 19 20 22 19 ...
## $ Ground : chr "0" "0" "0" "0" ...
## $ Summary : Factor w/ 4959 levels "","31-7952246The pilot reported that he had a rough running engine and was making an emergency landing at Charlo A"| __truncated__,..: 1544 1676 3566 3207 1820 1175 1637 1191 2228 2266 ...
There are columns which we can’t use for any analysis. These are: “Flight..”, “Registration”, “cn.In”, “Ground”
# Delete "FLight.." column
apc_clean$Flight.. <- NULL
# Delete "Registration" column
apc_clean$Registration <- NULL
# Delete "cn.In" column
apc_clean$cn.In <- NULL
# Delete "Ground" column
apc_clean$Ground <- NULL
Beautiful. We dropped unnecessary columns.
There are some non-ASCII characters in some cells. We should remove them.
nonutfeight <- apc_clean$Operator
nonutfeight[5315]
## [1] A\xe9ropro
## 2708 Levels: Aaxico Airlines Ababeel Aviaition ... Yemenia Airway
apc_clean$Operator <- gsub("[^\x20-\x7E]", "", nonutfeight)
apc_clean$Operator[5315]
## [1] "Aropro"
nonutfeight <- NULL
nonutfeight <- apc_clean$Summary
nonutfeight[5249]
## [1] The plane was being used as an air taxi to ferry passengers between cities when it crashed in the Manacapuru River, 50 miles from Manus. The plane took off under warning of strong winds and rain. The crew asked permission to return shortly after takeoff because they lost an engine.\xa0The plane disappeared from radar and lost contact with ATC.\xa0The plane crashed into the river when the crew tried to land the plane on an abandoned runway in the town of Manacapuru.
## 4959 Levels: ...
apc_clean$Summary <- gsub("[^\x20-\x7E]", "", nonutfeight)
apc_clean$Summary[5249]
## [1] "The plane was being used as an air taxi to ferry passengers between cities when it crashed in the Manacapuru River, 50 miles from Manus. The plane took off under warning of strong winds and rain. The crew asked permission to return shortly after takeoff because they lost an engine.The plane disappeared from radar and lost contact with ATC.The plane crashed into the river when the crew tried to land the plane on an abandoned runway in the town of Manacapuru."
Now it’s time to reshape time-related columns.
# Change date format
apc_clean$Date <- as.Date(apc_clean$Date, format = "%m/%d/%Y")
typeof(apc_clean$Date)
## [1] "double"
# Change & clean time format
apc_clean$Time <- gsub('c:', '', apc_clean$Time)
apc_clean$Time <- gsub('c', '', apc_clean$Time)
apc_clean$Time <- factor(as.character(substr(apc_clean$Time, 1, 2)))
typeof(apc_clean$Time)
## [1] "integer"
# New columns for month & year
apc_clean$Year = factor(format(as.Date(apc_clean$Date, format="%Y/%m/%d"),"%Y"))
apc_clean$Month = factor(format(as.Date(apc_clean$Date, format="%Y/%m/%d"),"%m"))
head(apc_clean$Year)
## [1] 1908 1912 1913 1913 1913 1915
## 106 Levels: 1908 1912 1913 1915 1916 1917 1918 1919 1920 1921 1922 ... 2017
head(apc_clean$Month)
## [1] 09 07 08 09 10 03
## Levels: 01 02 03 04 05 06 07 08 09 10 11 12
We can collect state & city information from “Location” column.
# We add new "State" & "City" column
apc_clean$State <- sapply(apc_clean$Location, as.character)
apc_clean$City <- sapply(apc_clean$Location, as.character)
# Seperate State & City
apc_clean$State <- str_trim(gsub(".*,", "", apc_clean$State))
apc_clean$City <- str_trim(gsub(",.*", "", apc_clean$City))
head(apc_clean$State)
## [1] "Virginia" "New Jersey" "Canada"
## [4] "Over the North Sea" "Germany" "Belgium"
head(apc_clean$City)
## [1] "Fort Myer" "AtlantiCity" "Victoria"
## [4] "Over the North Sea" "Near Johannisthal" "Tienen"
# Delete unnecessary "Location" column
apc_clean$Location <- NULL
Very nice. Now we are going to label flights as “Civilian” or “Military”. If “IsMilitary” equals to 1, that means that flight operator is a Military institution, otherwise it is a Civilian flight.
military_keywords <- c("Military", "Army", "Navy")
apc_clean$IsMilitary <- ifelse(grepl(paste(military_keywords, collapse = "|"), apc_clean$Operator),1,0)
# Number of Military Flights
sum(apc_clean$IsMilitary)
## [1] 814
Wonderful. Let’s find out how many passengers survived for each crash.
apc_clean$Survived <- apc_clean$Aboard - apc_clean$Fatalities
head(apc_clean$Survived)
## [1] 1 0 0 6 0 20
We can determine Source, Destination and number of stops from “Route” column.
apc_clean$Source <- gsub(" -.*", "", apc_clean$Route)
apc_clean$Destination <- gsub(".* -", "", apc_clean$Route)
apc_clean$Stops <- str_count(apc_clean$Route,"-")-1
# Number of flights with no Stops
length(which(apc_clean$Stops==0))
## [1] 3262
# Number of flights with no Route information
length(which(apc_clean$Stops<0))
## [1] 1954
# Number of flights with Stops
length(which(apc_clean$Stops>0))
## [1] 327
Lastly, we reordered the columns in apc_clean data frame.
apc_clean <- apc_clean[,c(1,2,9,10,11,12,3,13,4,15,16,17,5,6,7,14,8)]
# Latest structure of apc_clean
str(apc_clean)
## 'data.frame': 5543 obs. of 17 variables:
## $ Date : Date, format: "1908-09-17" "1912-07-12" ...
## $ Time : Factor w/ 32 levels "","00","01","02",..: 21 8 1 22 14 3 19 1 3 29 ...
## $ Year : Factor w/ 106 levels "1908","1912",..: 1 2 3 3 3 4 4 5 5 5 ...
## $ Month : Factor w/ 12 levels "01","02","03",..: 9 7 8 9 10 3 9 7 9 10 ...
## $ State : chr "Virginia" "New Jersey" "Canada" "Over the North Sea" ...
## $ City : chr "Fort Myer" "AtlantiCity" "Victoria" "Over the North Sea" ...
## $ Operator : chr "Military - U.S. Army" "Military - U.S. Navy" "Private" "Military - German Navy" ...
## $ IsMilitary : num 1 1 0 1 1 1 1 1 1 1 ...
## $ Route : Factor w/ 3470 levels "","Abaco - Miami",..: 830 2982 1 1 1 1 1 1 1 1 ...
## $ Source : chr "Demonstration" "Test flight" "" "" ...
## $ Destination: chr "Demonstration" "Test flight" "" "" ...
## $ Stops : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ Type : Factor w/ 2614 levels "","AAC-1 Toucan",..: 2419 1145 1024 2434 2437 2446 2433 2148 2439 2438 ...
## $ Aboard : int 2 5 1 20 30 41 19 20 22 19 ...
## $ Fatalities : int 1 5 1 14 30 21 19 20 22 19 ...
## $ Survived : int 1 0 0 6 0 20 0 0 0 0 ...
## $ Summary : chr "During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a heig"| __truncated__ "First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight." "The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed." "The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Is"| __truncated__ ...
You can find references for preprocessing from here.
Dataset: Airplane crashes from 1908 to 2017. Contains 17 variables, with 5.543 observations.
Here are the variables in our Airplane Crashes dataset and their explanations:
Understanding statistics and characteristics of past airplane crashes
Visualizing results of exploratory analysis
Demonstrating geolocations of airplane crashes on a static map
Developing and demonstrating R & RMarkdown skills of R_Junkies
We studied most of the kernels on Kaggle. We hope that at the end of our research, we can share our findings on Kaggle. :)
Let’s find out how many passengers died over years.
total_fatalities <- apc_clean %>%
filter(IsMilitary==0) %>%
group_by(Year) %>%
summarize(total_fat=sum(Fatalities)) %>%
mutate(total_fat = ifelse(is.na(total_fat), 0, total_fat))
ggplot(total_fatalities, aes(x=Year,y=total_fat,group = 1)) +
geom_line(aes(y=total_fat), size=0.5, alpha=1, colour="#7a1c1c") +
geom_area(fill = "lightblue") +
ggtitle("Fatalities of Civil Flights Over Years") +
geom_text(aes(label = paste(as.character(round(total_fat/1000,1)), "K")), check_overlap = TRUE, size=2.7, colour = "black", angle=30,nudge_y = -40) +
scale_x_discrete(breaks = levels(total_fatalities$Year)[c(T, rep(F, 5))]) +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8)) +
labs(x = "Years", y = "Fatalities")
As we can see, until 90’s the trend increases, but later it seems to decline: Less passengers died in airplane crashes.
top_ten_plane <- apc_clean %>%
group_by(Type) %>%
summarise(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(10)
## Selecting by Freq
top_ten_plane$Type <- substr(top_ten_plane$Type,1,19)
top_ten_plane
## # A tibble: 10 x 2
## Type Freq
## <chr> <int>
## 1 Douglas DC-3 334
## 2 de Havilland Canada 86
## 3 Douglas C-47A 74
## 4 Douglas C-47 62
## 5 Douglas DC-4 40
## 6 Yakovlev YAK-40 37
## 7 Antonov AN-26 36
## 8 Junkers JU-52/3m 32
## 9 Douglas C-47B 29
## 10 De Havilland DH-4 28
top_ten_mil_plane <- apc_clean %>%
filter(IsMilitary==1) %>%
group_by(Type) %>%
summarise(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(10)
## Selecting by Freq
# Slicing long strings
top_ten_mil_plane$Type <- substr(top_ten_mil_plane$Type,1,19)
top_ten_mil_plane
## # A tibble: 10 x 2
## Type Freq
## <chr> <int>
## 1 Douglas C-47 27
## 2 Lockheed C-130H 16
## 3 Antonov AN-26 15
## 4 Boeing KC-135A 15
## 5 Douglas C-47B 14
## 6 Lockheed C-130E Her 14
## 7 Lockheed C-130H Her 13
## 8 Douglas C-47A 12
## 9 Antonov AN-12 10
## 10 Mil Mi-8 (helicopte 9
top_ten_civ_plane <- apc_clean %>%
filter(IsMilitary==0) %>%
group_by(Type) %>%
summarise(Freq = n()) %>%
arrange(desc(Freq)) %>%
top_n(10)
## Selecting by Freq
# Slicing long strings
top_ten_civ_plane$Type <- substr(top_ten_civ_plane$Type,1,19)
top_ten_civ_plane
## # A tibble: 10 x 2
## Type Freq
## <chr> <int>
## 1 Douglas DC-3 333
## 2 de Havilland Canada 82
## 3 Douglas C-47A 62
## 4 Douglas DC-4 39
## 5 Yakovlev YAK-40 36
## 6 Douglas C-47 35
## 7 Junkers JU-52/3m 32
## 8 De Havilland DH-4 28
## 9 Cessna 208B Grand C 27
## 10 Douglas DC-6B 26
ggplot(top_ten_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
geom_bar(stat = "identity", fill = "#2780E3") +
labs(title="Top 10 Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1))
top_ten_mil_plane_plot <- ggplot(top_ten_mil_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
geom_bar(stat = "identity", fill = "#2780E3") +
labs(title="Top 10 Military Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8))
top_ten_civ_plane_plot <- ggplot(top_ten_civ_plane, aes(x=reorder(Type, -Freq), y=Freq)) +
geom_bar(stat = "identity", fill = "#2780E3") +
labs(title="Top 10 Civilian Airplanes by Crash Count",x="Airplanes",y="Frequency",fill="") +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8))
grid.arrange(top_ten_mil_plane_plot, top_ten_civ_plane_plot, ncol=2)
# Total departures in Million
total_dprt <-
air_dprt %>%
filter(Time >= 1970 & Time <= 2016) %>%
mutate(dep_val = ifelse(is.na(Value), 0, Value)) %>%
group_by(Time) %>% summarize(total_dep=sum(dep_val)/1000000)
# # of Accidents and Fatalities
Acc_fat_data <- apc_clean %>% filter(as.numeric(as.character(Year)) >= 1970 & as.numeric(as.character(Year)) <= 2016) %>% group_by(Year)%>% summarise(n=sum(ifelse(is.na(Aboard), 0, Aboard)),f=sum(ifelse(is.na(Fatalities), 0, Fatalities)))
# Combining Departures and Accident-Fatality Data
dep_acc_fat <- data.frame(year=total_dprt$Time,
Departure = total_dprt$total_dep,
Accident = Acc_fat_data$n,
Fatality = Acc_fat_data$f,
Rate = Acc_fat_data$n/total_dprt$total_dep)
# Normalizer for plot
normalizer <- max(dep_acc_fat$Accident)/max(dep_acc_fat$Departure)
ggplot(dep_acc_fat, aes(y=Accident/normalizer, x=year)) +
geom_col(aes(y = Departure), fill = "#2780E3") +
geom_line(size=1, alpha=1, colour = "#7a1c1c") +
ggtitle("Departures [Bar] & Accidents [Line]") +
scale_y_continuous(sec.axis = sec_axis(trans= ~.*normalizer, name = 'Departures')) +
theme(axis.text.y = element_text(colour="black", size=12),
axis.text.x=element_text(angle=60, vjust=1, hjust=1, size=8),
axis.title=element_text(colour="black", size=12),
legend.position = "left" ) +
labs(y="Accidents", x="Year")
top_six_civ_op_crash <- apc_clean %>%
filter(IsMilitary==0) %>%
group_by(Operator) %>%
summarise(SumCrashes= n()) %>%
arrange(desc(SumCrashes)) %>%
top_n(6)
## Selecting by SumCrashes
top_six_civ_op_fat <- apc_clean %>%
filter(IsMilitary==0) %>%
group_by(Operator) %>%
summarise(SumFats= sum(Fatalities)) %>%
arrange(desc(SumFats)) %>%
top_n(6)
## Selecting by SumFats
top_six_civ_op_crash_oy <- apc_clean %>%
filter(Operator %in% top_six_civ_op_crash$Operator) %>%
group_by(Operator, Year) %>%
summarise(SumFatalities= sum(Fatalities)) %>%
arrange(desc(Year), desc(SumFatalities))
top_six_civ_op_fat_oy <- apc_clean %>%
filter(Operator %in% top_six_civ_op_fat$Operator) %>%
group_by(Operator, Year) %>%
summarise(SumFatalities= sum(Fatalities)) %>%
arrange(desc(Year), desc(SumFatalities))
top_six_civ_op_crash
## # A tibble: 6 x 2
## Operator SumCrashes
## <chr> <int>
## 1 Aeroflot 179
## 2 Air France 70
## 3 Deutsche Lufthansa 65
## 4 Air Taxi 44
## 5 China National Aviation Corporation 44
## 6 United Air Lines 44
top_six_civ_op_fat
## # A tibble: 6 x 2
## Operator SumFats
## <chr> <int>
## 1 Aeroflot 7156
## 2 American Airlines 1421
## 3 Pan American World Airways 1302
## 4 United Air Lines 1021
## 5 AVIANCA 941
## 6 Turkish Airlines (THY) 891
top_six_civ_op_crash_oy
## # A tibble: 157 x 3
## # Groups: Operator [6]
## Operator Year SumFatalities
## <chr> <fct> <int>
## 1 Air France 2009 228
## 2 Aeroflot 2008 88
## 3 Air France 2005 0
## 4 Air Taxi 2004 6
## 5 Air Taxi 2002 4
## 6 United Air Lines 2001 109
## 7 Air Taxi 2001 6
## 8 Air France 2000 109
## 9 Air Taxi 2000 3
## 10 Air France 1998 53
## # ... with 147 more rows
top_six_civ_op_fat_oy
## # A tibble: 151 x 3
## # Groups: Operator [6]
## Operator Year SumFatalities
## <chr> <fct> <int>
## 1 Aeroflot 2008 88
## 2 Turkish Airlines (THY) 2003 75
## 3 American Airlines 2001 416
## 4 United Air Lines 2001 109
## 5 American Airlines 2000 1
## 6 American Airlines 1999 11
## 7 Turkish Airlines (THY) 1999 6
## 8 United Air Lines 1997 1
## 9 Aeroflot 1996 2
## 10 American Airlines 1995 160
## # ... with 141 more rows
ggplot(top_six_civ_op_crash_oy, aes(x=Year)) +
geom_line(aes(y=SumFatalities, group = top_six_civ_op_crash_oy$Operator, colour=top_six_civ_op_crash_oy$Operator), size=0.5, alpha=1) +
ggtitle("Fatalities of Top 6 Operators with Most Crashes Over Years") +
scale_x_discrete(breaks = levels(top_six_civ_op_crash_oy$Year)[c(T, rep(F, 3))]) +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8),
legend.position="bottom") +
labs(x = "Years", y = "Fatalities", colour = "Operators")
ggplot(top_six_civ_op_fat_oy, aes(x=Year)) +
geom_line(aes(y=SumFatalities, group = top_six_civ_op_fat_oy$Operator, colour=top_six_civ_op_fat_oy$Operator), size=0.5, alpha=1) +
ggtitle("Fatalities of Top 6 Operators with Most Fatalities Over Years") +
scale_x_discrete(breaks = levels(top_six_civ_op_fat_oy$Year)[c(T, rep(F, 3))]) +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8),
legend.position="bottom") +
labs(x = "Years", y = "Fatalities", colour = "Operators")
One of the feedbacks we had is that the line graphs “Fatalities of Top 6 Operators with Most Crashes/Fatalities Over Years” are not so easy to interpret. For that reason, we decided to convert them to a new heat map and show fatalities of top 10 civil operators with most fatalities over decades. We will generate a new data frame from our clean data using dplyr, and visualise it with ggplot2.
Now we generate a new data frame, ‘hm_df’ from apc_clean. hm_df will contain Operator, YearRange, Fatalities. And we need complete cases for visualization.
hm_df <- apc_clean %>%
select(Operator, Year, Fatalities, IsMilitary) %>%
filter(IsMilitary == 0) %>%
select(Operator, Year, Fatalities) %>%
filter(Operator != "")
hm_df <- hm_df[complete.cases(hm_df), ]
Now we need to generate a new column, ‘YearRange’ in order to find decades. We will use group_by.
hm_df <- hm_df %>%
mutate(Year = as.numeric(as.character(Year))) %>%
mutate(YearRange = paste( as.character( Year - (Year %% 10)),
as.character(Year - (Year %% 10) + 9),
sep = " - ")) %>%
select(-Year) %>%
group_by(Operator,YearRange) %>%
summarise(Fatalities = sum(Fatalities)) %>%
filter(Fatalities > 0) %>%
arrange(Operator, YearRange)
We want to show only top 10 operators with most fatalities. So we are going to find them. We will name it as civ_op_topten_fat.
civ_op_topten_fat <- apc_clean %>%
filter(IsMilitary == 0) %>%
select(Operator, Fatalities) %>%
group_by(Operator) %>%
summarise(Fatalities = sum(Fatalities)) %>%
arrange(desc(Fatalities)) %>%
slice(1:10) %>%
select(Operator)
Beautiful. Let’s join hm_df with civ_op_topten_fat!
hm_df <- hm_df %>% inner_join(., civ_op_topten_fat, by='Operator')
head(hm_df)
## # A tibble: 6 x 3
## # Groups: Operator [1]
## Operator YearRange Fatalities
## <chr> <chr> <int>
## 1 Aeroflot 1940 - 1949 24
## 2 Aeroflot 1950 - 1959 414
## 3 Aeroflot 1960 - 1969 951
## 4 Aeroflot 1970 - 1979 3512
## 5 Aeroflot 1980 - 1989 1814
## 6 Aeroflot 1990 - 1999 353
Since our data frame is ready for visualization, we can start using ggplot2.
ggplot(hm_df, aes(YearRange, Operator)) +
theme_classic() +
geom_tile(aes(fill = Fatalities), colour = "black") +
scale_fill_gradient(low = "#d0dee8", high = "#152a3a") +
theme(panel.background = element_rect(fill = "#e3e9ed", colour = "black")) +
labs(title='Fatalities of Top 10 Civil Operators Over Decades [With Aeroflot]', x='Decades', y='Operators') +
theme(plot.title = element_text(hjust = 0.0, size=11)) +
theme(axis.text.x=element_text(angle = 45, hjust = 1, size = 8))
Not so good! The variation between fatalities can’t be observed. When we searched for a reason, we figured out that there is an outlier: Aeroflot! In 70s they got more than 3K fatalities! It is very sad story for Aeroflot. Anyways, we decided to remove Aeroflot from visualization and generated a new heatmap.
hm_df <- hm_df %>% filter(Operator != 'Aeroflot')
ggplot(hm_df, aes(YearRange, Operator)) +
theme_classic() +
geom_tile(aes(fill = Fatalities), colour = "black") +
scale_fill_gradient(low = "#d0dee8", high = "#152a3a") +
theme(panel.background = element_rect(fill = "#e3e9ed", colour = "black")) +
labs(title='Fatalities of Top 10 Civil Operators Over Decades [Without Aeroflot]', x='Decades', y='Operators') +
theme(plot.title = element_text(hjust = 0.0, size=11)) +
theme(axis.text.x=element_text(angle = 45, hjust = 1, size = 8))
The heatmap is now better than the line graph we used early. As we have discussed in class, between 70s and 90s the world experienced lots of airplane crashes with lots of fatalities! But now this trend reverted to a decline. (Don’t overlook THY in 70s! There is a terrible crash in 1974.)
We wondered Turkish Airlines’ accidents and fatalities over time.
thy <- apc_clean %>%
filter(grepl("Turkish",Operator) & IsMilitary ==0) %>%
group_by(Operator, Year, Month) %>%
summarize(TotalAboard = sum(Aboard), TotalFatalities = sum(Fatalities)) %>%
arrange(Year, Month) %>%
mutate(MonthName = month.abb[Month]) %>%
select(-Month)
p <- plot_ly(thy, x = ~Year, y = ~TotalAboard, type = 'scatter', mode = 'markers', size = ~TotalFatalities, color = ~MonthName, colors = 'Paired',
marker = list(opacity = 1, sizemode = 'diameter')) %>%
layout(title = 'Turkish Airlines Stats Over Years',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE),
showlegend = TRUE)
p
The ratio of fatalities over total passengers is important. We want to see the relation between passengers carried and passengers died over the time.
Total_Passenger_Byyears <- air_psgr %>%
filter(!is.na(Value)) %>%
group_by(Time) %>%
summarize(Value=sum(Value)) %>%
mutate(Year = as.character(Time),Passengers = Value) %>%
select(-Time,-Value)
Total_Fatalities_Byyears <- apc_clean %>%
filter(IsMilitary==0) %>%
group_by(Year) %>%
summarize(TotalFatalities = sum(Fatalities)) %>%
mutate(TotalFatalities = ifelse(is.na(TotalFatalities), 0, TotalFatalities))
Total_Fatalities_Byyears$Year = as.character(Total_Fatalities_Byyears$Year)
death <- inner_join(Total_Passenger_Byyears,Total_Fatalities_Byyears, by = "Year")
death <-death %>%
mutate(DRate = TotalFatalities/Passengers*1000000)
ggplot(death, aes(x=Year,y=DRate,group = 1)) +
geom_line(aes(y=DRate), size=0.5, alpha=1, colour="#7a1c1c") +
ggtitle("Fatalities Over Total Passengers By Years") +
scale_x_discrete(breaks = levels(total_fatalities$Year)[c(T, rep(F, 5))]) +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1)) +
theme(plot.title = element_text(size=11),
axis.text=element_text(size=8)) +
labs(x = "Years", y = "Rate")
It is very clear that more passengers were carried, but less passengers died over years. It’s a good sign of increase in flight security.
# Survival Rate
dataSurvival <- apc_clean %>%
filter(as.numeric(as.character(Year)) >= 1900 & Aboard > 0) %>%
group_by(Year) %>%
summarize(TotalFatalities=sum(Fatalities),
TotalAboard=sum(Aboard),
SurvivalRate=round(100*(TotalAboard-TotalFatalities)/TotalAboard,2),
Survival = (TotalAboard-TotalFatalities),
TotalAccident = n())
# Normalizer for dual-y axis
n <- max(dataSurvival$TotalAccident) /max(dataSurvival$SurvivalRate)
ggplot(dataSurvival, aes(y=SurvivalRate, x=Year)) +
geom_col(aes(y=TotalAccident/n), fill = "#2780E3") +
geom_line(aes(group = 1), size=0.7, alpha=1, colour = "#7a1c1c") +
geom_smooth(aes(group = 1), colour = "black") +
scale_y_continuous(sec.axis = sec_axis(trans= ~.*n, name = 'Total Accidents')) +
scale_x_discrete(breaks = levels(dataSurvival$Year)[c(T, rep(F, 3))]) +
theme(axis.text.x=element_text(angle=60, vjust=1, hjust=1, size=8))
## `geom_smooth()` using method = 'loess'
We observed that “Survival Rate” was very low in early 1900’s and increased over time. The trend reverts after beginning of millenium.
For each accident, there is an explanation in “Summary” column. By looking at this summary information, we tried to visuailize a word cloud.
text <- apc_clean$Summary
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, PlainTextDocument)
# Convert text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
docs <- tm_map(docs, stemDocument)
# Remove possible words that high frequency in text that no meaning for crash reason
# (After first try in word cloud, we realize that these words are the most frequent one)
docs <- tm_map(docs, removeWords, c("aircraft", "plane","flight", "crash","crashed"))
dtm <- TermDocumentMatrix(docs)
##Remove sparse terms
removeSparseTerms(dtm, 0.95)
## <<TermDocumentMatrix (terms: 38, documents: 5543)>>
## Non-/sparse entries: 20279/190355
## Sparsity : 90%
## Maximal term length: 8
## Weighting : term frequency (tf)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Here are the most frequent 10 words, which may show the reason why the accident happened.
#Top 10 Word frequencies
ggplot(d[1:10,], aes(x=reorder(word, -freq), y=freq)) +
geom_bar(stat = "identity", fill = "#2780E3") +
labs(title="Top 10 Word frequencies",x="Words",y="Frequency",fill="") +
theme (axis.text.x=element_text (angle=60,vjust=1, hjust=1))
#How words associated with others. so we can understand causes
i <- 1
while (i < 6) {
print (findAssocs(dtm, terms = as.character(d[i,]$word), corlimit = 0.3))
i = i+1
}
## $pilot
## numeric(0)
##
## $land
## attempt gear
## 0.49 0.36
##
## $engin
## failur power
## 0.34 0.31
##
## $approach
## final
## 0.33
##
## $runway
## end overran
## 0.31 0.30
We prepare a world map that shows dangerous countries for flights.
# Crashes
apc_civilcrashes <- apc_clean %>%
select(State) %>%
filter(apc_clean$IsMilitary==0) %>%
group_by(State) %>%
summarise(Totalcrash=n())
apc_civilcrashes$State <- ifelse(apc_civilcrashes$State %in% state.name,"United States",apc_civilcrashes$State)
apc_civilcrashes <- apc_civilcrashes %>%
group_by(State) %>%
summarise(Totalcrash=sum(Totalcrash))
sPDF <- joinCountryData2Map(apc_civilcrashes, joinCode="NAME", nameJoinColumn="State")
## 175 codes from your data successfully matched countries in the map
## 226 codes from your data failed to match with a country code in the map
## 71 codes from the map weren't represented in your data
mapParams<-mapCountryData(sPDF
, nameColumnToPlot = "Totalcrash"
, mapTitle= "Most Dangerous Countries by Crashes"
, colourPalette=brewer.pal(9, "YlOrRd")
, missingCountryCol = "white"
, oceanCol = "lightsteelblue2"
, numCats=9
, catMethod="fixedWidth"
)
do.call(addMapLegend, c(mapParams
, legendLabels="all"
, legendIntervals="page"
))
# Fatalities
apc_civilfat <- apc_clean %>%
select(State,Fatalities) %>%
filter(apc_clean$IsMilitary==0) %>%
group_by(State) %>%
summarise(TotalFat=sum(Fatalities))
apc_civilfat$State <- ifelse(apc_civilfat$State %in% state.name,"United States",apc_civilfat$State)
apc_civilfat <- apc_civilfat %>%
group_by(State) %>%
summarise(TotalFat=sum(TotalFat))
sPDF <- joinCountryData2Map(apc_civilfat, joinCode="NAME", nameJoinColumn="State")
## 175 codes from your data successfully matched countries in the map
## 226 codes from your data failed to match with a country code in the map
## 71 codes from the map weren't represented in your data
mapParams<-mapCountryData(sPDF
, nameColumnToPlot = "TotalFat"
, mapTitle= "Most Dangerous Countries by Fatalities"
, colourPalette=brewer.pal(9, "YlOrRd")
, missingCountryCol = "white"
, oceanCol = "lightsteelblue2"
, numCats=9
, catMethod="fixedWidth"
)
do.call(addMapLegend, c(mapParams
, legendLabels="all"
, legendIntervals="page"
))
USA, Russia and Brazil seem to be the most dangerous places not to travel via plane!
We tried to answer 9 questions prepared for airplane crashes dataset. It seems that there are less airplane crashes nowadays. We hope that they will never happen in near & far future. :)
You are welcome to ask any questions or give suggestions. Thanks to Berk Orbay for his contributions and his instant replies. He enabled us to develop our coding skills.
We understood how to differentiate whether an operator is Military institution or a Civilian operator from adhok93’s kernel.
We got a lot of help from danielviray’s kernel in order to accomplish data preprocessing, but we used a different approach and generated our own columns.