In my own project, i am going to use “Aircraft Crash Data” which can be found in https://vincentarelbundock.github.io/Rdatasets/datasets.html
Number of rows:
nrow(air_crash)
## [1] 5666
Number of columns:
ncol(air_crash)
## [1] 8
Structure of the data is:
str(air_crash)
## 'data.frame': 5666 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : Factor w/ 5100 levels "1908-09-17","1912-07-12",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ location : Factor w/ 4622 levels "?","1,200 miles off Dakar, Atlantic Ocean",..: 888 136 4486 3650 2426 4360 3359 2420 301 3818 ...
## $ operator : Factor w/ 2729 levels " Airlines PNG",..: 1733 1745 2022 1623 1623 1623 1623 1622 1623 1623 ...
## $ planeType: Factor w/ 2663 levels "?","A-7D Corsair",..: 2633 1232 1110 2648 2650 2661 2649 2332 2654 2653 ...
## $ Dead : int 1 5 1 14 30 21 19 20 22 19 ...
## $ Aboard : int 2 5 1 20 30 41 19 20 22 19 ...
## $ Ground : int 0 0 0 0 0 0 0 0 0 0 ...
Summary of the data is:
summary(air_crash)
## X Date location
## Min. : 1 1960-02-26: 4 Moscow, Russia : 18
## 1st Qu.:1417 1973-02-28: 4 Rio de Janeiro, Brazil: 16
## Median :2834 1976-08-28: 4 Sao Paulo, Brazil : 15
## Mean :2834 1988-08-31: 4 Manila, Philippines : 14
## 3rd Qu.:4250 1992-08-27: 4 Anchorage, Alaska : 13
## Max. :5666 2001-09-11: 4 Bogota, Colombia : 13
## (Other) :5642 (Other) :5577
## operator
## Aeroflot : 260
## Military - U.S. Air Force : 177
## Air France : 72
## Deutsche Lufthansa : 65
## China National Aviation Corporation: 44
## United Air Lines : 44
## (Other) :5004
## planeType Dead
## Douglas DC-3 : 340 Min. : 0.00
## de Havilland Canada DHC-6 Twin Otter 300: 85 1st Qu.: 3.00
## Douglas C-47A : 74 Median : 9.00
## Douglas C-47 : 67 Mean : 19.81
## Douglas DC-4 : 41 3rd Qu.: 22.00
## Yakovlev YAK-40 : 37 Max. :583.00
## (Other) :5022 NA's :11
## Aboard Ground
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 13.00 Median : 0.000
## Mean : 27.38 Mean : 1.544
## 3rd Qu.: 30.00 3rd Qu.: 0.000
## Max. :644.00 Max. :2750.000
## NA's :41 NA's :74
I want to create a bar chart which shows the number of deaths in each year. In order to to this, i grouped the years using the dates of the accidents by using dplyr package and “substr” function. The code and the result is below:
subset(air_crash,!is.na(Dead)) %>% group_by(Year=substr(Date,1,4)) %>% summarise(Dead=sum(Dead))
## # A tibble: 103 x 2
## Year Dead
## <chr> <int>
## 1 1908 1
## 2 1912 5
## 3 1913 45
## 4 1915 40
## 5 1916 108
## 6 1917 138
## 7 1918 65
## 8 1919 5
## 9 1920 24
## 10 1921 68
## # ... with 93 more rows
Then i use the ggplot2 package to draw the bar chart.The code and the graphic is below: ***
ggplot(data=subset(air_crash,!is.na(Dead)) %>% group_by(Year=substr(Date,1,4)) %>% summarise(Dead=sum(Dead)),aes(x=Year,y=Dead)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=90))