Tutorial: Ibis for dplyr users

R users familiar with dplyr, tidyr, and other packages in the Tidyverse are likely to find Ibis familiar.

In fact, some Ibis features were even inspired by similar features in the Tidyverse.

However, due to differences between Python and R and the design and goals of Ibis itself, you may notice some big differences right away:

  • No pipe: The handy magrittr pipe (%>%) or R’s newer native pipe (|>) don’t exist in Python so you instead have to chain sequences of operations together with a period (.). The . in Python is analogous to R’s $ which lets you access attributes and methods on objects.
  • No unquoted column names: Non-standard evaluation is common in R but not present in Python. To reference a column in Ibis, you can pass a string, property on a table (e.g., tbl.some_column), or you can make use of selectors.
  • Ibis is lazy by default: Similar to dbplyr and its collect() method, Ibis does not evaluate our queries until we call .to_pandas(). For the purposes of this document, we set ibis.options.interactive = True which limits results to 10 rows, executes automatically, and prints a nicely-formatted table.

Using the same example data and similar operations as in Introduction to dplyr, below you will find some examples of the more common dplyr and tidyr operations and their Ibis equivalents.

Loading Ibis

To start using dplyr in R we would run:

library(dplyr)

To load Ibis:

import ibis

And then also load and alias some helpers to make our code more concise:

import ibis.selectors as s
from ibis import _

Last, as mentioned above, to get Ibis to automatically execute our queries and show the results in a nicely-formatted table, we run:

ibis.options.interactive = True

Loading example data

In R, datasets are typically lazily loaded with packages. For instance, the starwars dataset is packaged with dplyr, but is not loaded in memory before you start using it. Ibis provides many datasets in the ibis.examples module. So to be able to use the starwars dataset, you can use:

starwars = ibis.examples.starwars.fetch()

Similar to dplyr, if we evaluate the name of a table, we get a nicely-formatted table:

starwars
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name               ┃ height ┃ mass    ┃ hair_color    ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string             │ int64  │ float64 │ string        │ string      │ … │
├────────────────────┼────────┼─────────┼───────────────┼─────────────┼───┤
│ Luke Skywalker     │    172 │    77.0 │ blond         │ fair        │ … │
│ C-3PO              │    167 │    75.0 │ NULL          │ gold        │ … │
│ R2-D2              │     96 │    32.0 │ NULL          │ white, blue │ … │
│ Darth Vader        │    202 │   136.0 │ none          │ white       │ … │
│ Leia Organa        │    150 │    49.0 │ brown         │ light       │ … │
│ Owen Lars          │    178 │   120.0 │ brown, grey   │ light       │ … │
│ Beru Whitesun lars │    165 │    75.0 │ brown         │ light       │ … │
│ R5-D4              │     97 │    32.0 │ NULL          │ white, red  │ … │
│ Biggs Darklighter  │    183 │    84.0 │ black         │ light       │ … │
│ Obi-Wan Kenobi     │    182 │    77.0 │ auburn, white │ fair        │ … │
│ …                  │      … │       … │ …             │ …           │ … │
└────────────────────┴────────┴─────────┴───────────────┴─────────────┴───┘

In addition to printing a nicely-formatted table and automatically executing, setting ibis.options.interactive to True also causes our query to be limited to 10 rows. To get Ibis to give us all rows, we can directly call to_pandas and save the result as a pandas DataFrame:

starwars_df = starwars.to_pandas()

Which then gives us all of the data as a pandas DataFrame:

starwars_df
              name  height   mass hair_color  ... species films  vehicles starships
0   Luke Skywalker   172.0   77.0      blond  ...   Human  None      None      None
1            C-3PO   167.0   75.0       None  ...   Droid  None      None      None
2            R2-D2    96.0   32.0       None  ...   Droid  None      None      None
3      Darth Vader   202.0  136.0       none  ...   Human  None      None      None
4      Leia Organa   150.0   49.0      brown  ...   Human  None      None      None
..             ...     ...    ...        ...  ...     ...   ...       ...       ...
82             Rey     NaN    NaN      brown  ...   Human  None      None      None
83     Poe Dameron     NaN    NaN      brown  ...   Human  None      None      None
84             BB8     NaN    NaN       none  ...   Droid  None      None      None
85  Captain Phasma     NaN    NaN    unknown  ...    None  None      None      None
86   Padmé Amidala   165.0   45.0      brown  ...   Human  None      None      None

[87 rows x 14 columns]

Directly calling to_pandas and saving the result to a variable is useful for passing the results of Ibis table expressions to other packages (e.g., matplotlib.

Inspecting the dataset with head()

Just like in R, you can use head() to inspect the beginning of a dataset. You can also specify the number of rows you want to get back by using the parameter n (default n = 5).

In R:

head(starwars) # or starwars |> head()
# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

With Ibis:

starwars.head(6)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name           ┃ height ┃ mass    ┃ hair_color  ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string         │ int64  │ float64 │ string      │ string      │ … │
├────────────────┼────────┼─────────┼─────────────┼─────────────┼───┤
│ Luke Skywalker │    172 │    77.0 │ blond       │ fair        │ … │
│ C-3PO          │    167 │    75.0 │ NULL        │ gold        │ … │
│ R2-D2          │     96 │    32.0 │ NULL        │ white, blue │ … │
│ Darth Vader    │    202 │   136.0 │ none        │ white       │ … │
│ Leia Organa    │    150 │    49.0 │ brown       │ light       │ … │
│ Owen Lars      │    178 │   120.0 │ brown, grey │ light       │ … │
└────────────────┴────────┴─────────┴─────────────┴─────────────┴───┘

There is no tail() in Ibis because most databases do not support this operation.

Another method you can use to limit the number of rows returned by a query is limit() which also takes the n parameter.

starwars.limit(3)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━┓
┃ name           ┃ height ┃ mass    ┃ hair_color ┃ skin_color  ┃ eye_color ┃ … ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━┩
│ string         │ int64  │ float64 │ string     │ string      │ string    │ … │
├────────────────┼────────┼─────────┼────────────┼─────────────┼───────────┼───┤
│ Luke Skywalker │    172 │    77.0 │ blond      │ fair        │ blue      │ … │
│ C-3PO          │    167 │    75.0 │ NULL       │ gold        │ yellow    │ … │
│ R2-D2          │     96 │    32.0 │ NULL       │ white, blue │ red       │ … │
└────────────────┴────────┴─────────┴────────────┴─────────────┴───────────┴───┘

Filtering rows with filter()

Ibis, like dplyr, has filter to select rows based on conditions.

With dplyr:

starwars |>
  filter(skin_color == "light")
# A tibble: 11 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Leia Or…    150    49 brown      light      brown             19 fema… femin…
 2 Owen La…    178   120 brown, gr… light      blue              52 male  mascu…
 3 Beru Wh…    165    75 brown      light      blue              47 fema… femin…
 4 Biggs D…    183    84 black      light      brown             24 male  mascu…
 5 Lobot       175    79 none       light      blue              37 male  mascu…
 6 Cordé       157    NA brown      light      brown             NA fema… femin…
 7 Dormé       165    NA brown      light      brown             NA fema… femin…
 8 Raymus …    188    79 brown      light      brown             NA male  mascu…
 9 Rey          NA    NA brown      light      hazel             NA fema… femin…
10 Poe Dam…     NA    NA brown      light      brown             NA male  mascu…
11 Padmé A…    165    45 brown      light      brown             46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In Ibis:

starwars.filter(_.skin_color == "light")
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name               ┃ height ┃ mass    ┃ hair_color  ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string             │ int64  │ float64 │ string      │ string     │ … │
├────────────────────┼────────┼─────────┼─────────────┼────────────┼───┤
│ Leia Organa        │    150 │    49.0 │ brown       │ light      │ … │
│ Owen Lars          │    178 │   120.0 │ brown, grey │ light      │ … │
│ Beru Whitesun lars │    165 │    75.0 │ brown       │ light      │ … │
│ Biggs Darklighter  │    183 │    84.0 │ black       │ light      │ … │
│ Lobot              │    175 │    79.0 │ none        │ light      │ … │
│ Cordé              │    157 │     nan │ brown       │ light      │ … │
│ Dormé              │    165 │     nan │ brown       │ light      │ … │
│ Raymus Antilles    │    188 │    79.0 │ brown       │ light      │ … │
│ Rey                │   NULL │     nan │ brown       │ light      │ … │
│ Poe Dameron        │   NULL │     nan │ brown       │ light      │ … │
│ …                  │      … │       … │ …           │ …          │ … │
└────────────────────┴────────┴─────────┴─────────────┴────────────┴───┘

In dplyr, you can specify multiple conditions separated with , that are then combined with the & operator:

starwars |>
  filter(skin_color == "light", eye_color == "brown")
# A tibble: 7 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Biggs Da…    183    84 black      light      brown             24 male  mascu…
3 Cordé        157    NA brown      light      brown             NA fema… femin…
4 Dormé        165    NA brown      light      brown             NA fema… femin…
5 Raymus A…    188    79 brown      light      brown             NA male  mascu…
6 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
7 Padmé Am…    165    45 brown      light      brown             46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In Ibis, you can do the same by putting multiple conditions in a list:

starwars.filter([_.skin_color == "light", _.eye_color == "brown"])
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name              ┃ height ┃ mass    ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string            │ int64  │ float64 │ string     │ string     │ … │
├───────────────────┼────────┼─────────┼────────────┼────────────┼───┤
│ Leia Organa       │    150 │    49.0 │ brown      │ light      │ … │
│ Biggs Darklighter │    183 │    84.0 │ black      │ light      │ … │
│ Cordé             │    157 │     nan │ brown      │ light      │ … │
│ Dormé             │    165 │     nan │ brown      │ light      │ … │
│ Raymus Antilles   │    188 │    79.0 │ brown      │ light      │ … │
│ Poe Dameron       │   NULL │     nan │ brown      │ light      │ … │
│ Padmé Amidala     │    165 │    45.0 │ brown      │ light      │ … │
└───────────────────┴────────┴─────────┴────────────┴────────────┴───┘

In previous code, we used the _ helper we imported earlier. The _ is shorthand for the table returned by the previous step in the chained sequence of operations (in this case, starwars). We could have also written the more verbose form,

starwars.filter([starwars.skin_color == "light", starwars.eye_color == "brown"])
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name              ┃ height ┃ mass    ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string            │ int64  │ float64 │ string     │ string     │ … │
├───────────────────┼────────┼─────────┼────────────┼────────────┼───┤
│ Leia Organa       │    150 │    49.0 │ brown      │ light      │ … │
│ Biggs Darklighter │    183 │    84.0 │ black      │ light      │ … │
│ Cordé             │    157 │     nan │ brown      │ light      │ … │
│ Dormé             │    165 │     nan │ brown      │ light      │ … │
│ Raymus Antilles   │    188 │    79.0 │ brown      │ light      │ … │
│ Poe Dameron       │   NULL │     nan │ brown      │ light      │ … │
│ Padmé Amidala     │    165 │    45.0 │ brown      │ light      │ … │
└───────────────────┴────────┴─────────┴────────────┴────────────┴───┘

If you want to combine multiple conditions, in dplyr, you could do:

starwars |>
  filter(
      (skin_color == "light" & eye_color == "brown") |
       species == "Droid"
  )
# A tibble: 13 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
 2 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 3 Leia Or…    150    49 brown      light      brown             19 fema… femin…
 4 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 5 Biggs D…    183    84 black      light      brown             24 male  mascu…
 6 IG-88       200   140 none       metal      red               15 none  mascu…
 7 Cordé       157    NA brown      light      brown             NA fema… femin…
 8 Dormé       165    NA brown      light      brown             NA fema… femin…
 9 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
10 Raymus …    188    79 brown      light      brown             NA male  mascu…
11 Poe Dam…     NA    NA brown      light      brown             NA male  mascu…
12 BB8          NA    NA none       none       black             NA none  mascu…
13 Padmé A…    165    45 brown      light      brown             46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In Ibis, this would be:

starwars.filter(
    ((_.skin_color == "light") & (_.eye_color == "brown")) |
    (_.species == "Droid")
)
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name              ┃ height ┃ mass    ┃ hair_color ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string            │ int64  │ float64 │ string     │ string      │ … │
├───────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ C-3PO             │    167 │    75.0 │ NULL       │ gold        │ … │
│ R2-D2             │     96 │    32.0 │ NULL       │ white, blue │ … │
│ R5-D4             │     97 │    32.0 │ NULL       │ white, red  │ … │
│ IG-88             │    200 │   140.0 │ none       │ metal       │ … │
│ Leia Organa       │    150 │    49.0 │ brown      │ light       │ … │
│ Biggs Darklighter │    183 │    84.0 │ black      │ light       │ … │
│ Cordé             │    157 │     nan │ brown      │ light       │ … │
│ Dormé             │    165 │     nan │ brown      │ light       │ … │
│ R4-P17            │     96 │     nan │ none       │ silver, red │ … │
│ BB8               │   NULL │     nan │ none       │ none        │ … │
│ …                 │      … │       … │ …          │ …           │ … │
└───────────────────┴────────┴─────────┴────────────┴─────────────┴───┘

Sorting your data with order_by()

To sort a column, dplyr has the verb arrange. For instance, to sort the column height using dplyr:

starwars |>
   arrange(height)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Ibis has the order_by method, so to perform the same operation:

starwars.order_by(_.height)
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name                  ┃ height ┃ mass    ┃ hair_color ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string                │ int64  │ float64 │ string     │ string      │ … │
├───────────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ Yoda                  │     66 │    17.0 │ white      │ green       │ … │
│ Ratts Tyerell         │     79 │    15.0 │ none       │ grey, blue  │ … │
│ Wicket Systri Warrick │     88 │    20.0 │ brown      │ brown       │ … │
│ Dud Bolt              │     94 │    45.0 │ none       │ blue, grey  │ … │
│ R2-D2                 │     96 │    32.0 │ NULL       │ white, blue │ … │
│ R4-P17                │     96 │     nan │ none       │ silver, red │ … │
│ R5-D4                 │     97 │    32.0 │ NULL       │ white, red  │ … │
│ Sebulba               │    112 │    40.0 │ none       │ grey, red   │ … │
│ Gasgano               │    122 │     nan │ none       │ white, blue │ … │
│ Watto                 │    137 │     nan │ black      │ blue, grey  │ … │
│ …                     │      … │       … │ …          │ …           │ … │
└───────────────────────┴────────┴─────────┴────────────┴─────────────┴───┘

You might notice that while dplyr puts missing values at the end, Ibis places them at the top. This behavior can actually vary from backend to backend and is something to be aware of when using Ibis.

If you want to order using multiple variables, you can pass them as a list:

starwars.order_by([_.height, _.mass]) # or starwars.order_by(["height", "mass"])
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name                  ┃ height ┃ mass    ┃ hair_color ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string                │ int64  │ float64 │ string     │ string      │ … │
├───────────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ Yoda                  │     66 │    17.0 │ white      │ green       │ … │
│ Ratts Tyerell         │     79 │    15.0 │ none       │ grey, blue  │ … │
│ Wicket Systri Warrick │     88 │    20.0 │ brown      │ brown       │ … │
│ Dud Bolt              │     94 │    45.0 │ none       │ blue, grey  │ … │
│ R2-D2                 │     96 │    32.0 │ NULL       │ white, blue │ … │
│ R4-P17                │     96 │     nan │ none       │ silver, red │ … │
│ R5-D4                 │     97 │    32.0 │ NULL       │ white, red  │ … │
│ Sebulba               │    112 │    40.0 │ none       │ grey, red   │ … │
│ Gasgano               │    122 │     nan │ none       │ white, blue │ … │
│ Watto                 │    137 │     nan │ black      │ blue, grey  │ … │
│ …                     │      … │       … │ …          │ …           │ … │
└───────────────────────┴────────┴─────────┴────────────┴─────────────┴───┘

To order a column in descending order, there are two ways to do it. Note that missing values remain at the top.

starwars.order_by(_.height.desc()) # or: starwars.order_by(ibis.desc("height"))
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━┓
┃ name         ┃ height ┃ mass    ┃ hair_color ┃ skin_color   ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━┩
│ string       │ int64  │ float64 │ string     │ string       │ … │
├──────────────┼────────┼─────────┼────────────┼──────────────┼───┤
│ Yarael Poof  │    264 │     nan │ none       │ white        │ … │
│ Tarfful      │    234 │   136.0 │ brown      │ brown        │ … │
│ Lama Su      │    229 │    88.0 │ none       │ grey         │ … │
│ Chewbacca    │    228 │   112.0 │ brown      │ unknown      │ … │
│ Roos Tarpals │    224 │    82.0 │ none       │ grey         │ … │
│ Grievous     │    216 │   159.0 │ none       │ brown, white │ … │
│ Taun We      │    213 │     nan │ none       │ grey         │ … │
│ Rugor Nass   │    206 │     nan │ none       │ green        │ … │
│ Tion Medon   │    206 │    80.0 │ none       │ grey         │ … │
│ Darth Vader  │    202 │   136.0 │ none       │ white        │ … │
│ …            │      … │       … │ …          │ …            │ … │
└──────────────┴────────┴─────────┴────────────┴──────────────┴───┘

Selecting columns with select()

Ibis, like dplyr, has a select method to include or exclude columns:

With dplyr:

starwars |>
    select(hair_color)
# A tibble: 87 × 1
   hair_color   
   <chr>        
 1 blond        
 2 <NA>         
 3 <NA>         
 4 none         
 5 brown        
 6 brown, grey  
 7 brown        
 8 <NA>         
 9 black        
10 auburn, white
# ℹ 77 more rows

In Ibis:

starwars.select(_.hair_color)
┏━━━━━━━━━━━━━━━┓
┃ hair_color    ┃
┡━━━━━━━━━━━━━━━┩
│ string        │
├───────────────┤
│ blond         │
│ NULL          │
│ NULL          │
│ none          │
│ brown         │
│ brown, grey   │
│ brown         │
│ NULL          │
│ black         │
│ auburn, white │
│ …             │
└───────────────┘

Note: A common pitfall to be aware of when referencing column names in Ibis is when column names collide with built-in methods on the Ibis Table object, such as count. In this situation, you will have to reference count like table["count"] or _["count"].

dplyr also allows selecting more than one column at a time:

starwars |>
    select(hair_color, skin_color, eye_color)
# A tibble: 87 × 3
   hair_color    skin_color  eye_color
   <chr>         <chr>       <chr>    
 1 blond         fair        blue     
 2 <NA>          gold        yellow   
 3 <NA>          white, blue red      
 4 none          white       yellow   
 5 brown         light       brown    
 6 brown, grey   light       blue     
 7 brown         light       blue     
 8 <NA>          white, red  red      
 9 black         light       brown    
10 auburn, white fair        blue-gray
# ℹ 77 more rows

In Ibis, we can either quote the names:

starwars.select("hair_color", "skin_color", "eye_color")
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color    ┃ skin_color  ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string        │ string      │ string    │
├───────────────┼─────────────┼───────────┤
│ blond         │ fair        │ blue      │
│ NULL          │ gold        │ yellow    │
│ NULL          │ white, blue │ red       │
│ none          │ white       │ yellow    │
│ brown         │ light       │ brown     │
│ brown, grey   │ light       │ blue      │
│ brown         │ light       │ blue      │
│ NULL          │ white, red  │ red       │
│ black         │ light       │ brown     │
│ auburn, white │ fair        │ blue-gray │
│ …             │ …           │ …         │
└───────────────┴─────────────┴───────────┘

Or use the _ helper:

starwars.select(_.hair_color, _.skin_color, _.eye_color)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color    ┃ skin_color  ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string        │ string      │ string    │
├───────────────┼─────────────┼───────────┤
│ blond         │ fair        │ blue      │
│ NULL          │ gold        │ yellow    │
│ NULL          │ white, blue │ red       │
│ none          │ white       │ yellow    │
│ brown         │ light       │ brown     │
│ brown, grey   │ light       │ blue      │
│ brown         │ light       │ blue      │
│ NULL          │ white, red  │ red       │
│ black         │ light       │ brown     │
│ auburn, white │ fair        │ blue-gray │
│ …             │ …           │ …         │
└───────────────┴─────────────┴───────────┘

To select columns by name based on a condition, dplyr has helpers such as:

  • starts_with(): Starts with a prefix.
  • ends_with(): Ends with a suffix.
  • contains(): Contains a literal string.

These and many more selectors are available in Ibis as well, with slightly different names:

starwars.select(s.startswith("h"))
┏━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ height ┃ hair_color    ┃ homeworld ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64  │ string        │ string    │
├────────┼───────────────┼───────────┤
│    172 │ blond         │ Tatooine  │
│    167 │ NULL          │ Tatooine  │
│     96 │ NULL          │ Naboo     │
│    202 │ none          │ Tatooine  │
│    150 │ brown         │ Alderaan  │
│    178 │ brown, grey   │ Tatooine  │
│    165 │ brown         │ Tatooine  │
│     97 │ NULL          │ Tatooine  │
│    183 │ black         │ Tatooine  │
│    182 │ auburn, white │ Stewjon   │
│      … │ …             │ …         │
└────────┴───────────────┴───────────┘
starwars.select(s.endswith("color"))
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color    ┃ skin_color  ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string        │ string      │ string    │
├───────────────┼─────────────┼───────────┤
│ blond         │ fair        │ blue      │
│ NULL          │ gold        │ yellow    │
│ NULL          │ white, blue │ red       │
│ none          │ white       │ yellow    │
│ brown         │ light       │ brown     │
│ brown, grey   │ light       │ blue      │
│ brown         │ light       │ blue      │
│ NULL          │ white, red  │ red       │
│ black         │ light       │ brown     │
│ auburn, white │ fair        │ blue-gray │
│ …             │ …           │ …         │
└───────────────┴─────────────┴───────────┘
starwars.select(s.contains("world"))
┏━━━━━━━━━━━┓
┃ homeworld ┃
┡━━━━━━━━━━━┩
│ string    │
├───────────┤
│ Tatooine  │
│ Tatooine  │
│ Naboo     │
│ Tatooine  │
│ Alderaan  │
│ Tatooine  │
│ Tatooine  │
│ Tatooine  │
│ Tatooine  │
│ Stewjon   │
│ …         │
└───────────┘

See the Ibis Column Selectors documentation for the full list of selectors in Ibis.

Renaming columns with rename()

Ibis allows you to rename columns using rename() which provides similar functionality to rename() in dplyr.

In dplyr:

starwars |>
    rename("home_world" = "homeworld")
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: home_world <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

In Ibis, use rename and pass a dict of name mappings:

starwars.rename(home_world="homeworld")
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name               ┃ height ┃ mass    ┃ hair_color    ┃ skin_color  ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string             │ int64  │ float64 │ string        │ string      │ … │
├────────────────────┼────────┼─────────┼───────────────┼─────────────┼───┤
│ Luke Skywalker     │    172 │    77.0 │ blond         │ fair        │ … │
│ C-3PO              │    167 │    75.0 │ NULL          │ gold        │ … │
│ R2-D2              │     96 │    32.0 │ NULL          │ white, blue │ … │
│ Darth Vader        │    202 │   136.0 │ none          │ white       │ … │
│ Leia Organa        │    150 │    49.0 │ brown         │ light       │ … │
│ Owen Lars          │    178 │   120.0 │ brown, grey   │ light       │ … │
│ Beru Whitesun lars │    165 │    75.0 │ brown         │ light       │ … │
│ R5-D4              │     97 │    32.0 │ NULL          │ white, red  │ … │
│ Biggs Darklighter  │    183 │    84.0 │ black         │ light       │ … │
│ Obi-Wan Kenobi     │    182 │    77.0 │ auburn, white │ fair        │ … │
│ …                  │      … │       … │ …             │ …           │ … │
└────────────────────┴────────┴─────────┴───────────────┴─────────────┴───┘

Add new columns with mutate()

Ibis, like dplyr, uses the mutate verb to add columns.

In dplyr,

starwars |>
    mutate(height_m = height / 100) |>
    select(name, height_m)
# A tibble: 87 × 2
   name               height_m
   <chr>                 <dbl>
 1 Luke Skywalker         1.72
 2 C-3PO                  1.67
 3 R2-D2                  0.96
 4 Darth Vader            2.02
 5 Leia Organa            1.5 
 6 Owen Lars              1.78
 7 Beru Whitesun lars     1.65
 8 R5-D4                  0.97
 9 Biggs Darklighter      1.83
10 Obi-Wan Kenobi         1.82
# ℹ 77 more rows

In Ibis:

(
    starwars
        .mutate(height_m = _.height / 100)
        .select("name", "height_m")
)
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ name               ┃ height_m ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ string             │ float64  │
├────────────────────┼──────────┤
│ Luke Skywalker     │     1.72 │
│ C-3PO              │     1.67 │
│ R2-D2              │     0.96 │
│ Darth Vader        │     2.02 │
│ Leia Organa        │     1.50 │
│ Owen Lars          │     1.78 │
│ Beru Whitesun lars │     1.65 │
│ R5-D4              │     0.97 │
│ Biggs Darklighter  │     1.83 │
│ Obi-Wan Kenobi     │     1.82 │
│ …                  │        … │
└────────────────────┴──────────┘

A big difference between dplyr’s mutate and Ibis’ mutate is that, in Ibis, you have to chain separate mutate calls together when you reference newly-created columns in the same mutate whereas in dplyr, you can put them all in the same call. This makes Ibis’ mutate more similar to transform in base R.

In dplyr, we only need one mutate call:

starwars %>%
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) %>%
  select(BMI, everything())
# A tibble: 87 × 16
     BMI name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1  26.0 Luke Sky…    172    77 blond      fair       blue            19   male 
 2  26.9 C-3PO        167    75 <NA>       gold       yellow         112   none 
 3  34.7 R2-D2         96    32 <NA>       white, bl… red             33   none 
 4  33.3 Darth Va…    202   136 none       white      yellow          41.9 male 
 5  21.8 Leia Org…    150    49 brown      light      brown           19   fema…
 6  37.9 Owen Lars    178   120 brown, gr… light      blue            52   male 
 7  27.5 Beru Whi…    165    75 brown      light      blue            47   fema…
 8  34.0 R5-D4         97    32 <NA>       white, red red             NA   none 
 9  25.1 Biggs Da…    183    84 black      light      brown           24   male 
10  23.2 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>, height_m <dbl>

In Ibis, for BMI to reference height_m, it needs to be in a separate mutate call:

(starwars
    .mutate(
        height_m = _.height / 100
    )
    .mutate(
        BMI = _.mass / (_.height_m**2)
    )
    .select("BMI", ~s.matches("BMI"))
)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━┓
┃ BMI       ┃ name               ┃ height ┃ mass    ┃ hair_color    ┃ … ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━┩
│ float64   │ string             │ int64  │ float64 │ string        │ … │
├───────────┼────────────────────┼────────┼─────────┼───────────────┼───┤
│ 26.027582 │ Luke Skywalker     │    172 │    77.0 │ blond         │ … │
│ 26.892323 │ C-3PO              │    167 │    75.0 │ NULL          │ … │
│ 34.722222 │ R2-D2              │     96 │    32.0 │ NULL          │ … │
│ 33.330066 │ Darth Vader        │    202 │   136.0 │ none          │ … │
│ 21.777778 │ Leia Organa        │    150 │    49.0 │ brown         │ … │
│ 37.874006 │ Owen Lars          │    178 │   120.0 │ brown, grey   │ … │
│ 27.548209 │ Beru Whitesun lars │    165 │    75.0 │ brown         │ … │
│ 34.009990 │ R5-D4              │     97 │    32.0 │ NULL          │ … │
│ 25.082863 │ Biggs Darklighter  │    183 │    84.0 │ black         │ … │
│ 23.245985 │ Obi-Wan Kenobi     │    182 │    77.0 │ auburn, white │ … │
│         … │ …                  │      … │       … │ …             │ … │
└───────────┴────────────────────┴────────┴─────────┴───────────────┴───┘

Summarize values with aggregate()

To summarize tables, dplyr has the verbs summarise/summarize:

In dplyr:

starwars %>%
    summarise(height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
  height
   <dbl>
1   174.

In Ibis, the corresponding verb is aggregate:

starwars.aggregate(height = _.height.mean())
┏━━━━━━━━━━━━┓
┃ height     ┃
┡━━━━━━━━━━━━┩
│ float64    │
├────────────┤
│ 174.358025 │
└────────────┘

Note: Throughout this guide, where dplyr uses R generics, Ibis uses Python methods. In the previous code cell, aggregate is a method on a table and mean is a method on a column. If you want to perform aggregations on multiple columns, you can call the method that you want on the column you want to apply it to.

Join tables with left_join()

To demonstrate how to do joins with Ibis, we’ll load two more example datasets that also come from the example datasets included in dplyr:

band_members = ibis.examples.band_members.fetch()
band_instruments = ibis.examples.band_instruments.fetch()

In dplyr, we can perform a left join of these two tables like:

band_members |>
    left_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

In Ibis:

band_members.left_join(band_instruments, "name")
┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ name   ┃ band    ┃ name_right ┃ plays  ┃
┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ string │ string  │ string     │ string │
├────────┼─────────┼────────────┼────────┤
│ John   │ Beatles │ John       │ guitar │
│ Paul   │ Beatles │ Paul       │ bass   │
│ Mick   │ Stones  │ NULL       │ NULL   │
└────────┴─────────┴────────────┴────────┘

There are two main differences between Ibis and dplyr here:

  • Ibis requires us to explicitly specify our join key (“name”, in this example) whereas in dplyr, if the join key is missing, we get the natural join of the two tables which joins across all shared column names
  • Ibis keeps columns for join keys from each table whereas dplyr does not by default

To replicate the result we’d get by default in dplyr but using Ibis, we need to incorporate two other verbs we’ve already seen in this tutorial:

(
    band_members
        .left_join(band_instruments, "name")
        .select(~s.contains("_right"))
)
┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ name   ┃ band    ┃ plays  ┃
┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ string │ string  │ string │
├────────┼─────────┼────────┤
│ John   │ Beatles │ guitar │
│ Paul   │ Beatles │ bass   │
│ Mick   │ Stones  │ NULL   │
└────────┴─────────┴────────┘

Pivot data with pivot_wider() and pivot_longer()

dplyr users are likely to be familiar with the pivot_wider and pivot_longer functions from the tidyr package which convert tables between wide and long formats, respectively.

pivot_longer in dplyr + tidyr:

library(tidyr)

starwars_colors <-
    starwars |>
        select(name, matches("color")) |>
        pivot_longer(matches("color"), names_to = "attribute", values_to = "color")

In Ibis:

starwars_colors = (
    starwars
        .select("name", s.matches("color"))
        .pivot_longer(s.matches("color"), names_to="attribute", values_to="color")
)

starwars_colors
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ name           ┃ attribute  ┃ color       ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string         │ string     │ string      │
├────────────────┼────────────┼─────────────┤
│ Luke Skywalker │ hair_color │ blond       │
│ Luke Skywalker │ skin_color │ fair        │
│ Luke Skywalker │ eye_color  │ blue        │
│ C-3PO          │ hair_color │ NULL        │
│ C-3PO          │ skin_color │ gold        │
│ C-3PO          │ eye_color  │ yellow      │
│ R2-D2          │ hair_color │ NULL        │
│ R2-D2          │ skin_color │ white, blue │
│ R2-D2          │ eye_color  │ red         │
│ Darth Vader    │ hair_color │ none        │
│ …              │ …          │ …           │
└────────────────┴────────────┴─────────────┘

And pivot_wider:

starwars_colors |>
    pivot_wider(names_from = "attribute", values_from = "color")
# A tibble: 87 × 4
   name               hair_color    skin_color  eye_color
   <chr>              <chr>         <chr>       <chr>    
 1 Luke Skywalker     blond         fair        blue     
 2 C-3PO              <NA>          gold        yellow   
 3 R2-D2              <NA>          white, blue red      
 4 Darth Vader        none          white       yellow   
 5 Leia Organa        brown         light       brown    
 6 Owen Lars          brown, grey   light       blue     
 7 Beru Whitesun lars brown         light       blue     
 8 R5-D4              <NA>          white, red  red      
 9 Biggs Darklighter  black         light       brown    
10 Obi-Wan Kenobi     auburn, white fair        blue-gray
# ℹ 77 more rows

In Ibis:

(
    starwars_colors.
        pivot_wider(names_from="attribute", values_from="color")
)
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ name               ┃ hair_color    ┃ skin_color  ┃ eye_color ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string             │ string        │ string      │ string    │
├────────────────────┼───────────────┼─────────────┼───────────┤
│ Luke Skywalker     │ blond         │ fair        │ blue      │
│ C-3PO              │ NULL          │ gold        │ yellow    │
│ R2-D2              │ NULL          │ white, blue │ red       │
│ Darth Vader        │ none          │ white       │ yellow    │
│ Leia Organa        │ brown         │ light       │ brown     │
│ Owen Lars          │ brown, grey   │ light       │ blue      │
│ Beru Whitesun lars │ brown         │ light       │ blue      │
│ R5-D4              │ NULL          │ white, red  │ red       │
│ Biggs Darklighter  │ black         │ light       │ brown     │
│ Obi-Wan Kenobi     │ auburn, white │ fair        │ blue-gray │
│ …                  │ …             │ …           │ …         │
└────────────────────┴───────────────┴─────────────┴───────────┘

Next Steps

Now that you’ve gotten an introduction to the common differences between dplyr and Ibis, head over to Getting started with ibis for a full introduction. If you’re familiar with SQL, check out Ibis for SQL users. If you’re familiar with pandas, take a look at Ibis for pandas users

Back to top