In my own project, i am going to use “Aircraft Crash Data” which can be found in https://vincentarelbundock.github.io/Rdatasets/datasets.html


Number of rows:

nrow(air_crash)
## [1] 5666

Number of columns:

ncol(air_crash)
## [1] 8

Structure of the data is:

str(air_crash)
## 'data.frame':    5666 obs. of  8 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date     : Factor w/ 5100 levels "1908-09-17","1912-07-12",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ location : Factor w/ 4622 levels "?","1,200 miles off Dakar, Atlantic Ocean",..: 888 136 4486 3650 2426 4360 3359 2420 301 3818 ...
##  $ operator : Factor w/ 2729 levels " Airlines  PNG",..: 1733 1745 2022 1623 1623 1623 1623 1622 1623 1623 ...
##  $ planeType: Factor w/ 2663 levels "?","A-7D Corsair",..: 2633 1232 1110 2648 2650 2661 2649 2332 2654 2653 ...
##  $ Dead     : int  1 5 1 14 30 21 19 20 22 19 ...
##  $ Aboard   : int  2 5 1 20 30 41 19 20 22 19 ...
##  $ Ground   : int  0 0 0 0 0 0 0 0 0 0 ...

Summary of the data is:

summary(air_crash)
##        X                Date                        location   
##  Min.   :   1   1960-02-26:   4   Moscow, Russia        :  18  
##  1st Qu.:1417   1973-02-28:   4   Rio de Janeiro, Brazil:  16  
##  Median :2834   1976-08-28:   4   Sao Paulo, Brazil     :  15  
##  Mean   :2834   1988-08-31:   4   Manila, Philippines   :  14  
##  3rd Qu.:4250   1992-08-27:   4   Anchorage, Alaska     :  13  
##  Max.   :5666   2001-09-11:   4   Bogota, Colombia      :  13  
##                 (Other)   :5642   (Other)               :5577  
##                                 operator   
##  Aeroflot                           : 260  
##  Military - U.S. Air Force          : 177  
##  Air France                         :  72  
##  Deutsche Lufthansa                 :  65  
##  China National Aviation Corporation:  44  
##  United Air Lines                   :  44  
##  (Other)                            :5004  
##                                     planeType         Dead       
##  Douglas DC-3                            : 340   Min.   :  0.00  
##  de Havilland Canada DHC-6 Twin Otter 300:  85   1st Qu.:  3.00  
##  Douglas C-47A                           :  74   Median :  9.00  
##  Douglas C-47                            :  67   Mean   : 19.81  
##  Douglas DC-4                            :  41   3rd Qu.: 22.00  
##  Yakovlev YAK-40                         :  37   Max.   :583.00  
##  (Other)                                 :5022   NA's   :11      
##      Aboard           Ground        
##  Min.   :  0.00   Min.   :   0.000  
##  1st Qu.:  5.00   1st Qu.:   0.000  
##  Median : 13.00   Median :   0.000  
##  Mean   : 27.38   Mean   :   1.544  
##  3rd Qu.: 30.00   3rd Qu.:   0.000  
##  Max.   :644.00   Max.   :2750.000  
##  NA's   :41       NA's   :74

I want to create a bar chart which shows the number of deaths in each year. In order to to this, i grouped the years using the dates of the accidents by using dplyr package and “substr” function. The code and the result is below:

subset(air_crash,!is.na(Dead)) %>% group_by(Year=substr(Date,1,4)) %>% summarise(Dead=sum(Dead)) 
## # A tibble: 103 x 2
##     Year  Dead
##    <chr> <int>
##  1  1908     1
##  2  1912     5
##  3  1913    45
##  4  1915    40
##  5  1916   108
##  6  1917   138
##  7  1918    65
##  8  1919     5
##  9  1920    24
## 10  1921    68
## # ... with 93 more rows

Then i use the ggplot2 package to draw the bar chart.The code and the graphic is below: ***

ggplot(data=subset(air_crash,!is.na(Dead)) %>% group_by(Year=substr(Date,1,4)) %>% summarise(Dead=sum(Dead)),aes(x=Year,y=Dead)) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90))