dplyr

Quick Introduction

Berk Orbay

MEF BDA 503

What is dplyr?

dplyr is an R package specialized on data manipulation. You can simply

  • select columns
  • filter rows according to criteria
  • arrange and sort rows
  • generate derivative columns with mutate and transmute
  • create pivot tables with group_by and summarise

which are essentially required to perform data manipulation tasks. For more complex tasks, dplyr also offers more commands.

 

Official page: https://dplyr.tidyverse.org/

Why dplyr?

dplyr is one of the most versatile data manipulation libraries ever built in any language.

  • Inspired by many best practices and learned from past mistakes. See here and here for more information.
  • It is a one-stop shop for most data manipulation processes such as pandas in Python.
  • Familiarization with SQL-like syntax.
  • Backend is written in C++. Thus, high speed computations.
  • Backed by a comprehensive set of complementary libraries (i.e, tidyverse) and used by many other libraries (e.g, dbplyr for SQL operations).

Who develops dplyr?

dplyr is created by Hadley Wickham in 2014. See more information about Hadley on his webpage https://hadley.nz/.

  • Since then a lot of developers contributed to the core dplyr. You can see a full list here.
  • dplyr and many associated libraries under tidyverse are being developed and supported by Posit (formerly known as RStudio).

See more information on its official webpage https://dplyr.tidyverse.org/.

Fundamentals

Important Full notes are on Book of EDA


##  If you have never installed dplyr before
##  install.packages("dplyr")

##  First load the library
library(dplyr)

## We are going to use starwars datasets to show examples
starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

A data frame in dplyr is called a tibble. tibble has more and different functionality than a regular data.frame.

select

select simply displays columns

starwars %>% select(name, homeworld)
# A tibble: 87 × 2
   name               homeworld
   <chr>              <chr>    
 1 Luke Skywalker     Tatooine 
 2 C-3PO              Tatooine 
 3 R2-D2              Naboo    
 4 Darth Vader        Tatooine 
 5 Leia Organa        Alderaan 
 6 Owen Lars          Tatooine 
 7 Beru Whitesun lars Tatooine 
 8 R5-D4              Tatooine 
 9 Biggs Darklighter  Tatooine 
10 Obi-Wan Kenobi     Stewjon  
# ℹ 77 more rows

Notice the pipe operator (%>%)? It is used to chain dplyr (and other) commands.

See also rename on notes.

arrange

arrange simply sorts values (A-Z, 0-9 or reverse)

starwars %>%
    arrange(species, desc(height)) %>%
    select(name, species, height)
# A tibble: 87 × 3
   name            species  height
   <chr>           <chr>     <int>
 1 Ratts Tyerell   Aleena       79
 2 Dexter Jettster Besalisk    198
 3 Ki-Adi-Mundi    Cerean      198
 4 Mas Amedda      Chagrian    196
 5 Zam Wesell      Clawdite    168
 6 IG-88           Droid       200
 7 C-3PO           Droid       167
 8 R5-D4           Droid        97
 9 R2-D2           Droid        96
10 R4-P17          Droid        96
# ℹ 77 more rows

filter

filter show rows according to specified criteria

starwars %>%
    filter(species == "Droid") %>%
    arrange(desc(mass)) %>%
    select(name, species, homeworld, mass)
# A tibble: 6 × 4
  name   species homeworld  mass
  <chr>  <chr>   <chr>     <dbl>
1 IG-88  Droid   <NA>        140
2 C-3PO  Droid   Tatooine     75
3 R2-D2  Droid   Naboo        32
4 R5-D4  Droid   Tatooine     32
5 R4-P17 Droid   <NA>         NA
6 BB8    Droid   <NA>         NA

mutate

You can do operations with mutate and transmute (see on full notes)

starwars %>%
    select(name, height, mass) %>%
    filter(complete.cases(.)) %>%
    mutate(bmi = mass / ((height / 100)^2)) %>% ## body mass index
    arrange(desc(bmi))
# A tibble: 59 × 4
   name                  height  mass   bmi
   <chr>                  <int> <dbl> <dbl>
 1 Jabba Desilijic Tiure    175  1358 443. 
 2 Dud Bolt                  94    45  50.9
 3 Yoda                      66    17  39.0
 4 Owen Lars                178   120  37.9
 5 IG-88                    200   140  35  
 6 R2-D2                     96    32  34.7
 7 Grievous                 216   159  34.1
 8 R5-D4                     97    32  34.0
 9 Jek Tono Porkins         180   110  34.0
10 Darth Vader              202   136  33.3
# ℹ 49 more rows

group_by and summarise

group_by and summarise allows you to create summary (pivot) tables.

starwars %>%
    group_by(species) %>%
    summarise(mean_mass = mean(mass, na.rm = TRUE)) %>%
    ungroup() %>%
    arrange(desc(mean_mass))
# A tibble: 38 × 2
   species      mean_mass
   <chr>            <dbl>
 1 Hutt            1358  
 2 Kaleesh          159  
 3 Wookiee          124  
 4 Trandoshan       113  
 5 Besalisk         102  
 6 Neimodian         90  
 7 Kaminoan          88  
 8 Nautolan          87  
 9 Mon Calamari      83  
10 Human             82.8
# ℹ 28 more rows

Advanced dplyr

There is always more to learn. Here are some examples.

  • lead, lag to offset rows
  • pivot_wider, pivot_longer to transform to wide and long formats, respectively
  • separate to separate rows in a column according to separation rules (similar to Text to Columns in Excel)
  • *_join to join to data tables

See extra resources section on Book of EDA.

Thanks!

Course webpage https://mef-bda503.github.io/.