library(dplyr)
Tutorial: Ibis for dplyr users
R users familiar with dplyr, tidyr, and other packages in the Tidyverse are likely to find Ibis familiar.
In fact, some Ibis features were even inspired by similar features in the Tidyverse.
However, due to differences between Python and R and the design and goals of Ibis itself, you may notice some big differences right away:
- No pipe: The handy magrittr pipe (
%>%
) or R’s newer native pipe (|>
) don’t exist in Python so you instead have to chain sequences of operations together with a period (.
). The.
in Python is analogous to R’s$
which lets you access attributes and methods on objects. - No unquoted column names: Non-standard evaluation is common in R but not present in Python. To reference a column in Ibis, you can pass a string, property on a table (e.g.,
tbl.some_column
), or you can make use of selectors. - Ibis is lazy by default: Similar to dbplyr and its
collect()
method, Ibis does not evaluate our queries until we call.to_pandas()
. For the purposes of this document, we setibis.options.interactive = True
which limits results to 10 rows, executes automatically, and prints a nicely-formatted table.
Using the same example data and similar operations as in Introduction to dplyr, below you will find some examples of the more common dplyr and tidyr operations and their Ibis equivalents.
Loading Ibis
To start using dplyr in R we would run:
To load Ibis:
import ibis
And then also load and alias some helpers to make our code more concise:
import ibis.selectors as s
from ibis import _
Last, as mentioned above, to get Ibis to automatically execute our queries and show the results in a nicely-formatted table, we run:
= True ibis.options.interactive
Loading example data
In R, datasets are typically lazily loaded with packages. For instance, the starwars
dataset is packaged with dplyr, but is not loaded in memory before you start using it. Ibis provides many datasets in the ibis.examples
module. So to be able to use the starwars
dataset, you can use:
= ibis.examples.starwars.fetch() starwars
Similar to dplyr, if we evaluate the name of a table, we get a nicely-formatted table:
starwars
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├────────────────────┼────────┼─────────┼───────────────┼─────────────┼───┤
│ Luke Skywalker │ 172 │ 77.0 │ blond │ fair │ … │
│ C-3PO │ 167 │ 75.0 │ NULL │ gold │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ Darth Vader │ 202 │ 136.0 │ none │ white │ … │
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Owen Lars │ 178 │ 120.0 │ brown, grey │ light │ … │
│ Beru Whitesun lars │ 165 │ 75.0 │ brown │ light │ … │
│ R5-D4 │ 97 │ 32.0 │ NULL │ white, red │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Obi-Wan Kenobi │ 182 │ 77.0 │ auburn, white │ fair │ … │
│ … │ … │ … │ … │ … │ … │
└────────────────────┴────────┴─────────┴───────────────┴─────────────┴───┘
In addition to printing a nicely-formatted table and automatically executing, setting ibis.options.interactive
to True
also causes our query to be limited to 10 rows. To get Ibis to give us all rows, we can directly call to_pandas
and save the result as a pandas DataFrame:
= starwars.to_pandas() starwars_df
Which then gives us all of the data as a pandas DataFrame:
starwars_df
name height mass hair_color ... species films vehicles starships
0 Luke Skywalker 172.0 77.0 blond ... Human None None None
1 C-3PO 167.0 75.0 None ... Droid None None None
2 R2-D2 96.0 32.0 None ... Droid None None None
3 Darth Vader 202.0 136.0 none ... Human None None None
4 Leia Organa 150.0 49.0 brown ... Human None None None
.. ... ... ... ... ... ... ... ... ...
82 Rey NaN NaN brown ... Human None None None
83 Poe Dameron NaN NaN brown ... Human None None None
84 BB8 NaN NaN none ... Droid None None None
85 Captain Phasma NaN NaN unknown ... None None None None
86 Padmé Amidala 165.0 45.0 brown ... Human None None None
[87 rows x 14 columns]
Directly calling to_pandas
and saving the result to a variable is useful for passing the results of Ibis table expressions to other packages (e.g., matplotlib
.
Inspecting the dataset with head()
Just like in R, you can use head()
to inspect the beginning of a dataset. You can also specify the number of rows you want to get back by using the parameter n
(default n = 5
).
In R:
head(starwars) # or starwars |> head()
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
With Ibis:
6) starwars.head(
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├────────────────┼────────┼─────────┼─────────────┼─────────────┼───┤
│ Luke Skywalker │ 172 │ 77.0 │ blond │ fair │ … │
│ C-3PO │ 167 │ 75.0 │ NULL │ gold │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ Darth Vader │ 202 │ 136.0 │ none │ white │ … │
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Owen Lars │ 178 │ 120.0 │ brown, grey │ light │ … │
└────────────────┴────────┴─────────┴─────────────┴─────────────┴───┘
There is no tail()
in Ibis because most databases do not support this operation.
Another method you can use to limit the number of rows returned by a query is limit()
which also takes the n
parameter.
3) starwars.limit(
┏━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ eye_color ┃ … ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ string │ … │
├────────────────┼────────┼─────────┼────────────┼─────────────┼───────────┼───┤
│ Luke Skywalker │ 172 │ 77.0 │ blond │ fair │ blue │ … │
│ C-3PO │ 167 │ 75.0 │ NULL │ gold │ yellow │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ red │ … │
└────────────────┴────────┴─────────┴────────────┴─────────────┴───────────┴───┘
Filtering rows with filter()
Ibis, like dplyr, has filter
to select rows based on conditions.
With dplyr:
|>
starwars filter(skin_color == "light")
# A tibble: 11 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Or… 150 49 brown light brown 19 fema… femin…
2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
3 Beru Wh… 165 75 brown light blue 47 fema… femin…
4 Biggs D… 183 84 black light brown 24 male mascu…
5 Lobot 175 79 none light blue 37 male mascu…
6 Cordé 157 NA brown light brown NA fema… femin…
7 Dormé 165 NA brown light brown NA fema… femin…
8 Raymus … 188 79 brown light brown NA male mascu…
9 Rey NA NA brown light hazel NA fema… femin…
10 Poe Dam… NA NA brown light brown NA male mascu…
11 Padmé A… 165 45 brown light brown 46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
In Ibis:
filter(_.skin_color == "light") starwars.
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├────────────────────┼────────┼─────────┼─────────────┼────────────┼───┤
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Owen Lars │ 178 │ 120.0 │ brown, grey │ light │ … │
│ Beru Whitesun lars │ 165 │ 75.0 │ brown │ light │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Lobot │ 175 │ 79.0 │ none │ light │ … │
│ Cordé │ 157 │ nan │ brown │ light │ … │
│ Dormé │ 165 │ nan │ brown │ light │ … │
│ Raymus Antilles │ 188 │ 79.0 │ brown │ light │ … │
│ Rey │ NULL │ nan │ brown │ light │ … │
│ Poe Dameron │ NULL │ nan │ brown │ light │ … │
│ … │ … │ … │ … │ … │ … │
└────────────────────┴────────┴─────────┴─────────────┴────────────┴───┘
In dplyr, you can specify multiple conditions separated with ,
that are then combined with the &
operator:
|>
starwars filter(skin_color == "light", eye_color == "brown")
# A tibble: 7 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Biggs Da… 183 84 black light brown 24 male mascu…
3 Cordé 157 NA brown light brown NA fema… femin…
4 Dormé 165 NA brown light brown NA fema… femin…
5 Raymus A… 188 79 brown light brown NA male mascu…
6 Poe Dame… NA NA brown light brown NA male mascu…
7 Padmé Am… 165 45 brown light brown 46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
In Ibis, you can do the same by putting multiple conditions in a list:
filter([_.skin_color == "light", _.eye_color == "brown"]) starwars.
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├───────────────────┼────────┼─────────┼────────────┼────────────┼───┤
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Cordé │ 157 │ nan │ brown │ light │ … │
│ Dormé │ 165 │ nan │ brown │ light │ … │
│ Raymus Antilles │ 188 │ 79.0 │ brown │ light │ … │
│ Poe Dameron │ NULL │ nan │ brown │ light │ … │
│ Padmé Amidala │ 165 │ 45.0 │ brown │ light │ … │
└───────────────────┴────────┴─────────┴────────────┴────────────┴───┘
In previous code, we used the _
helper we imported earlier. The _
is shorthand for the table returned by the previous step in the chained sequence of operations (in this case, starwars
). We could have also written the more verbose form,
filter([starwars.skin_color == "light", starwars.eye_color == "brown"]) starwars.
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├───────────────────┼────────┼─────────┼────────────┼────────────┼───┤
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Cordé │ 157 │ nan │ brown │ light │ … │
│ Dormé │ 165 │ nan │ brown │ light │ … │
│ Raymus Antilles │ 188 │ 79.0 │ brown │ light │ … │
│ Poe Dameron │ NULL │ nan │ brown │ light │ … │
│ Padmé Amidala │ 165 │ 45.0 │ brown │ light │ … │
└───────────────────┴────────┴─────────┴────────────┴────────────┴───┘
If you want to combine multiple conditions, in dplyr, you could do:
|>
starwars filter(
== "light" & eye_color == "brown") |
(skin_color == "Droid"
species )
# A tibble: 13 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 R5-D4 97 32 <NA> white, red red NA none mascu…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 IG-88 200 140 none metal red 15 none mascu…
7 Cordé 157 NA brown light brown NA fema… femin…
8 Dormé 165 NA brown light brown NA fema… femin…
9 R4-P17 96 NA none silver, r… red, blue NA none femin…
10 Raymus … 188 79 brown light brown NA male mascu…
11 Poe Dam… NA NA brown light brown NA male mascu…
12 BB8 NA NA none none black NA none mascu…
13 Padmé A… 165 45 brown light brown 46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
In Ibis, this would be:
filter(
starwars.== "light") & (_.eye_color == "brown")) |
((_.skin_color == "Droid")
(_.species )
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├───────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ C-3PO │ 167 │ 75.0 │ NULL │ gold │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ R5-D4 │ 97 │ 32.0 │ NULL │ white, red │ … │
│ IG-88 │ 200 │ 140.0 │ none │ metal │ … │
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Cordé │ 157 │ nan │ brown │ light │ … │
│ Dormé │ 165 │ nan │ brown │ light │ … │
│ R4-P17 │ 96 │ nan │ none │ silver, red │ … │
│ BB8 │ NULL │ nan │ none │ none │ … │
│ … │ … │ … │ … │ … │ … │
└───────────────────┴────────┴─────────┴────────────┴─────────────┴───┘
Sorting your data with order_by()
To sort a column, dplyr has the verb arrange
. For instance, to sort the column height
using dplyr:
|>
starwars arrange(height)
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yoda 66 17 white green brown 896 male mascu…
2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
3 Wicket … 88 20 brown brown brown 8 male mascu…
4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
6 R4-P17 96 NA none silver, r… red, blue NA none femin…
7 R5-D4 97 32 <NA> white, red red NA none mascu…
8 Sebulba 112 40 none grey, red orange NA male mascu…
9 Gasgano 122 NA none white, bl… black NA male mascu…
10 Watto 137 NA black blue, grey yellow NA male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Ibis has the order_by
method, so to perform the same operation:
starwars.order_by(_.height)
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├───────────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ Yoda │ 66 │ 17.0 │ white │ green │ … │
│ Ratts Tyerell │ 79 │ 15.0 │ none │ grey, blue │ … │
│ Wicket Systri Warrick │ 88 │ 20.0 │ brown │ brown │ … │
│ Dud Bolt │ 94 │ 45.0 │ none │ blue, grey │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ R4-P17 │ 96 │ nan │ none │ silver, red │ … │
│ R5-D4 │ 97 │ 32.0 │ NULL │ white, red │ … │
│ Sebulba │ 112 │ 40.0 │ none │ grey, red │ … │
│ Gasgano │ 122 │ nan │ none │ white, blue │ … │
│ Watto │ 137 │ nan │ black │ blue, grey │ … │
│ … │ … │ … │ … │ … │ … │
└───────────────────────┴────────┴─────────┴────────────┴─────────────┴───┘
You might notice that while dplyr puts missing values at the end, Ibis places them at the top. This behavior can actually vary from backend to backend and is something to be aware of when using Ibis.
If you want to order using multiple variables, you can pass them as a list:
# or starwars.order_by(["height", "mass"]) starwars.order_by([_.height, _.mass])
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├───────────────────────┼────────┼─────────┼────────────┼─────────────┼───┤
│ Yoda │ 66 │ 17.0 │ white │ green │ … │
│ Ratts Tyerell │ 79 │ 15.0 │ none │ grey, blue │ … │
│ Wicket Systri Warrick │ 88 │ 20.0 │ brown │ brown │ … │
│ Dud Bolt │ 94 │ 45.0 │ none │ blue, grey │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ R4-P17 │ 96 │ nan │ none │ silver, red │ … │
│ R5-D4 │ 97 │ 32.0 │ NULL │ white, red │ … │
│ Sebulba │ 112 │ 40.0 │ none │ grey, red │ … │
│ Gasgano │ 122 │ nan │ none │ white, blue │ … │
│ Watto │ 137 │ nan │ black │ blue, grey │ … │
│ … │ … │ … │ … │ … │ … │
└───────────────────────┴────────┴─────────┴────────────┴─────────────┴───┘
To order a column in descending order, there are two ways to do it. Note that missing values remain at the top.
# or: starwars.order_by(ibis.desc("height")) starwars.order_by(_.height.desc())
┏━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├──────────────┼────────┼─────────┼────────────┼──────────────┼───┤
│ Yarael Poof │ 264 │ nan │ none │ white │ … │
│ Tarfful │ 234 │ 136.0 │ brown │ brown │ … │
│ Lama Su │ 229 │ 88.0 │ none │ grey │ … │
│ Chewbacca │ 228 │ 112.0 │ brown │ unknown │ … │
│ Roos Tarpals │ 224 │ 82.0 │ none │ grey │ … │
│ Grievous │ 216 │ 159.0 │ none │ brown, white │ … │
│ Taun We │ 213 │ nan │ none │ grey │ … │
│ Rugor Nass │ 206 │ nan │ none │ green │ … │
│ Tion Medon │ 206 │ 80.0 │ none │ grey │ … │
│ Darth Vader │ 202 │ 136.0 │ none │ white │ … │
│ … │ … │ … │ … │ … │ … │
└──────────────┴────────┴─────────┴────────────┴──────────────┴───┘
Selecting columns with select()
Ibis, like dplyr, has a select
method to include or exclude columns:
With dplyr:
|>
starwars select(hair_color)
# A tibble: 87 × 1
hair_color
<chr>
1 blond
2 <NA>
3 <NA>
4 none
5 brown
6 brown, grey
7 brown
8 <NA>
9 black
10 auburn, white
# ℹ 77 more rows
In Ibis:
starwars.select(_.hair_color)
┏━━━━━━━━━━━━━━━┓
┃ hair_color ┃
┡━━━━━━━━━━━━━━━┩
│ string │
├───────────────┤
│ blond │
│ NULL │
│ NULL │
│ none │
│ brown │
│ brown, grey │
│ brown │
│ NULL │
│ black │
│ auburn, white │
│ … │
└───────────────┘
Note: A common pitfall to be aware of when referencing column names in Ibis is when column names collide with built-in methods on the Ibis Table object, such as count
. In this situation, you will have to reference count
like table["count"]
or _["count"]
.
dplyr also allows selecting more than one column at a time:
|>
starwars select(hair_color, skin_color, eye_color)
# A tibble: 87 × 3
hair_color skin_color eye_color
<chr> <chr> <chr>
1 blond fair blue
2 <NA> gold yellow
3 <NA> white, blue red
4 none white yellow
5 brown light brown
6 brown, grey light blue
7 brown light blue
8 <NA> white, red red
9 black light brown
10 auburn, white fair blue-gray
# ℹ 77 more rows
In Ibis, we can either quote the names:
"hair_color", "skin_color", "eye_color") starwars.select(
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color ┃ skin_color ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │ string │
├───────────────┼─────────────┼───────────┤
│ blond │ fair │ blue │
│ NULL │ gold │ yellow │
│ NULL │ white, blue │ red │
│ none │ white │ yellow │
│ brown │ light │ brown │
│ brown, grey │ light │ blue │
│ brown │ light │ blue │
│ NULL │ white, red │ red │
│ black │ light │ brown │
│ auburn, white │ fair │ blue-gray │
│ … │ … │ … │
└───────────────┴─────────────┴───────────┘
Or use the _
helper:
starwars.select(_.hair_color, _.skin_color, _.eye_color)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color ┃ skin_color ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │ string │
├───────────────┼─────────────┼───────────┤
│ blond │ fair │ blue │
│ NULL │ gold │ yellow │
│ NULL │ white, blue │ red │
│ none │ white │ yellow │
│ brown │ light │ brown │
│ brown, grey │ light │ blue │
│ brown │ light │ blue │
│ NULL │ white, red │ red │
│ black │ light │ brown │
│ auburn, white │ fair │ blue-gray │
│ … │ … │ … │
└───────────────┴─────────────┴───────────┘
To select columns by name based on a condition, dplyr has helpers such as:
starts_with()
: Starts with a prefix.ends_with()
: Ends with a suffix.contains()
: Contains a literal string.
These and many more selectors are available in Ibis as well, with slightly different names:
"h")) starwars.select(s.startswith(
┏━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ height ┃ hair_color ┃ homeworld ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64 │ string │ string │
├────────┼───────────────┼───────────┤
│ 172 │ blond │ Tatooine │
│ 167 │ NULL │ Tatooine │
│ 96 │ NULL │ Naboo │
│ 202 │ none │ Tatooine │
│ 150 │ brown │ Alderaan │
│ 178 │ brown, grey │ Tatooine │
│ 165 │ brown │ Tatooine │
│ 97 │ NULL │ Tatooine │
│ 183 │ black │ Tatooine │
│ 182 │ auburn, white │ Stewjon │
│ … │ … │ … │
└────────┴───────────────┴───────────┘
"color")) starwars.select(s.endswith(
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ hair_color ┃ skin_color ┃ eye_color ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │ string │
├───────────────┼─────────────┼───────────┤
│ blond │ fair │ blue │
│ NULL │ gold │ yellow │
│ NULL │ white, blue │ red │
│ none │ white │ yellow │
│ brown │ light │ brown │
│ brown, grey │ light │ blue │
│ brown │ light │ blue │
│ NULL │ white, red │ red │
│ black │ light │ brown │
│ auburn, white │ fair │ blue-gray │
│ … │ … │ … │
└───────────────┴─────────────┴───────────┘
"world")) starwars.select(s.contains(
┏━━━━━━━━━━━┓
┃ homeworld ┃
┡━━━━━━━━━━━┩
│ string │
├───────────┤
│ Tatooine │
│ Tatooine │
│ Naboo │
│ Tatooine │
│ Alderaan │
│ Tatooine │
│ Tatooine │
│ Tatooine │
│ Tatooine │
│ Stewjon │
│ … │
└───────────┘
See the Ibis Column Selectors documentation for the full list of selectors in Ibis.
Renaming columns with rename()
Ibis allows you to rename columns using rename()
which provides similar functionality to rename()
in dplyr.
In dplyr:
|>
starwars rename("home_world" = "homeworld")
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: home_world <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
In Ibis, use rename
and pass a dict
of name mappings:
="homeworld") starwars.rename(home_world
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━┓
┃ name ┃ height ┃ mass ┃ hair_color ┃ skin_color ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━┩
│ string │ int64 │ float64 │ string │ string │ … │
├────────────────────┼────────┼─────────┼───────────────┼─────────────┼───┤
│ Luke Skywalker │ 172 │ 77.0 │ blond │ fair │ … │
│ C-3PO │ 167 │ 75.0 │ NULL │ gold │ … │
│ R2-D2 │ 96 │ 32.0 │ NULL │ white, blue │ … │
│ Darth Vader │ 202 │ 136.0 │ none │ white │ … │
│ Leia Organa │ 150 │ 49.0 │ brown │ light │ … │
│ Owen Lars │ 178 │ 120.0 │ brown, grey │ light │ … │
│ Beru Whitesun lars │ 165 │ 75.0 │ brown │ light │ … │
│ R5-D4 │ 97 │ 32.0 │ NULL │ white, red │ … │
│ Biggs Darklighter │ 183 │ 84.0 │ black │ light │ … │
│ Obi-Wan Kenobi │ 182 │ 77.0 │ auburn, white │ fair │ … │
│ … │ … │ … │ … │ … │ … │
└────────────────────┴────────┴─────────┴───────────────┴─────────────┴───┘
Add new columns with mutate()
Ibis, like dplyr, uses the mutate
verb to add columns.
In dplyr,
|>
starwars mutate(height_m = height / 100) |>
select(name, height_m)
# A tibble: 87 × 2
name height_m
<chr> <dbl>
1 Luke Skywalker 1.72
2 C-3PO 1.67
3 R2-D2 0.96
4 Darth Vader 2.02
5 Leia Organa 1.5
6 Owen Lars 1.78
7 Beru Whitesun lars 1.65
8 R5-D4 0.97
9 Biggs Darklighter 1.83
10 Obi-Wan Kenobi 1.82
# ℹ 77 more rows
In Ibis:
(
starwars= _.height / 100)
.mutate(height_m "name", "height_m")
.select( )
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ name ┃ height_m ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ string │ float64 │
├────────────────────┼──────────┤
│ Luke Skywalker │ 1.72 │
│ C-3PO │ 1.67 │
│ R2-D2 │ 0.96 │
│ Darth Vader │ 2.02 │
│ Leia Organa │ 1.50 │
│ Owen Lars │ 1.78 │
│ Beru Whitesun lars │ 1.65 │
│ R5-D4 │ 0.97 │
│ Biggs Darklighter │ 1.83 │
│ Obi-Wan Kenobi │ 1.82 │
│ … │ … │
└────────────────────┴──────────┘
A big difference between dplyr’s mutate
and Ibis’ mutate
is that, in Ibis, you have to chain separate mutate
calls together when you reference newly-created columns in the same mutate
whereas in dplyr, you can put them all in the same call. This makes Ibis’ mutate
more similar to transform
in base R.
In dplyr, we only need one mutate
call:
%>%
starwars mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
%>%
) select(BMI, everything())
# A tibble: 87 × 16
BMI name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 26.0 Luke Sky… 172 77 blond fair blue 19 male
2 26.9 C-3PO 167 75 <NA> gold yellow 112 none
3 34.7 R2-D2 96 32 <NA> white, bl… red 33 none
4 33.3 Darth Va… 202 136 none white yellow 41.9 male
5 21.8 Leia Org… 150 49 brown light brown 19 fema…
6 37.9 Owen Lars 178 120 brown, gr… light blue 52 male
7 27.5 Beru Whi… 165 75 brown light blue 47 fema…
8 34.0 R5-D4 97 32 <NA> white, red red NA none
9 25.1 Biggs Da… 183 84 black light brown 24 male
10 23.2 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>, height_m <dbl>
In Ibis, for BMI
to reference height_m
, it needs to be in a separate mutate
call:
(starwars
.mutate(= _.height / 100
height_m
)
.mutate(= _.mass / (_.height_m**2)
BMI
)"BMI", ~s.matches("BMI"))
.select( )
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━┓
┃ BMI ┃ name ┃ height ┃ mass ┃ hair_color ┃ … ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━┩
│ float64 │ string │ int64 │ float64 │ string │ … │
├───────────┼────────────────────┼────────┼─────────┼───────────────┼───┤
│ 26.027582 │ Luke Skywalker │ 172 │ 77.0 │ blond │ … │
│ 26.892323 │ C-3PO │ 167 │ 75.0 │ NULL │ … │
│ 34.722222 │ R2-D2 │ 96 │ 32.0 │ NULL │ … │
│ 33.330066 │ Darth Vader │ 202 │ 136.0 │ none │ … │
│ 21.777778 │ Leia Organa │ 150 │ 49.0 │ brown │ … │
│ 37.874006 │ Owen Lars │ 178 │ 120.0 │ brown, grey │ … │
│ 27.548209 │ Beru Whitesun lars │ 165 │ 75.0 │ brown │ … │
│ 34.009990 │ R5-D4 │ 97 │ 32.0 │ NULL │ … │
│ 25.082863 │ Biggs Darklighter │ 183 │ 84.0 │ black │ … │
│ 23.245985 │ Obi-Wan Kenobi │ 182 │ 77.0 │ auburn, white │ … │
│ … │ … │ … │ … │ … │ … │
└───────────┴────────────────────┴────────┴─────────┴───────────────┴───┘
Summarize values with aggregate()
To summarize tables, dplyr has the verbs summarise
/summarize
:
In dplyr:
%>%
starwars summarise(height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
height
<dbl>
1 174.
In Ibis, the corresponding verb is aggregate
:
= _.height.mean()) starwars.aggregate(height
┏━━━━━━━━━━━━┓
┃ height ┃
┡━━━━━━━━━━━━┩
│ float64 │
├────────────┤
│ 174.358025 │
└────────────┘
Note: Throughout this guide, where dplyr uses R generics, Ibis uses Python methods. In the previous code cell, aggregate
is a method on a table and mean
is a method on a column. If you want to perform aggregations on multiple columns, you can call the method that you want on the column you want to apply it to.
Join tables with left_join()
To demonstrate how to do joins with Ibis, we’ll load two more example datasets that also come from the example datasets included in dplyr:
= ibis.examples.band_members.fetch()
band_members = ibis.examples.band_instruments.fetch() band_instruments
In dplyr, we can perform a left join of these two tables like:
|>
band_members left_join(band_instruments)
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
In Ibis:
"name") band_members.left_join(band_instruments,
┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ name ┃ band ┃ name_right ┃ plays ┃
┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │ string │
├────────┼─────────┼────────────┼────────┤
│ John │ Beatles │ John │ guitar │
│ Paul │ Beatles │ Paul │ bass │
│ Mick │ Stones │ NULL │ NULL │
└────────┴─────────┴────────────┴────────┘
There are two main differences between Ibis and dplyr here:
- Ibis requires us to explicitly specify our join key (“name”, in this example) whereas in dplyr, if the join key is missing, we get the natural join of the two tables which joins across all shared column names
- Ibis keeps columns for join keys from each table whereas dplyr does not by default
To replicate the result we’d get by default in dplyr but using Ibis, we need to incorporate two other verbs we’ve already seen in this tutorial:
(
band_members"name")
.left_join(band_instruments, ~s.contains("_right"))
.select( )
┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ name ┃ band ┃ plays ┃
┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ string │ string │ string │
├────────┼─────────┼────────┤
│ John │ Beatles │ guitar │
│ Paul │ Beatles │ bass │
│ Mick │ Stones │ NULL │
└────────┴─────────┴────────┘
Pivot data with pivot_wider()
and pivot_longer()
dplyr users are likely to be familiar with the pivot_wider
and pivot_longer
functions from the tidyr package which convert tables between wide and long formats, respectively.
pivot_longer
in dplyr + tidyr:
library(tidyr)
<-
starwars_colors |>
starwars select(name, matches("color")) |>
pivot_longer(matches("color"), names_to = "attribute", values_to = "color")
In Ibis:
= (
starwars_colors
starwars"name", s.matches("color"))
.select("color"), names_to="attribute", values_to="color")
.pivot_longer(s.matches(
)
starwars_colors
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ name ┃ attribute ┃ color ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ string │ string │ string │
├────────────────┼────────────┼─────────────┤
│ Luke Skywalker │ hair_color │ blond │
│ Luke Skywalker │ skin_color │ fair │
│ Luke Skywalker │ eye_color │ blue │
│ C-3PO │ hair_color │ NULL │
│ C-3PO │ skin_color │ gold │
│ C-3PO │ eye_color │ yellow │
│ R2-D2 │ hair_color │ NULL │
│ R2-D2 │ skin_color │ white, blue │
│ R2-D2 │ eye_color │ red │
│ Darth Vader │ hair_color │ none │
│ … │ … │ … │
└────────────────┴────────────┴─────────────┘
And pivot_wider
:
|>
starwars_colors pivot_wider(names_from = "attribute", values_from = "color")
# A tibble: 87 × 4
name hair_color skin_color eye_color
<chr> <chr> <chr> <chr>
1 Luke Skywalker blond fair blue
2 C-3PO <NA> gold yellow
3 R2-D2 <NA> white, blue red
4 Darth Vader none white yellow
5 Leia Organa brown light brown
6 Owen Lars brown, grey light blue
7 Beru Whitesun lars brown light blue
8 R5-D4 <NA> white, red red
9 Biggs Darklighter black light brown
10 Obi-Wan Kenobi auburn, white fair blue-gray
# ℹ 77 more rows
In Ibis:
(
starwars_colors.="attribute", values_from="color")
pivot_wider(names_from )
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ name ┃ hair_color ┃ skin_color ┃ eye_color ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │ string │ string │
├────────────────────┼───────────────┼─────────────┼───────────┤
│ Luke Skywalker │ blond │ fair │ blue │
│ C-3PO │ NULL │ gold │ yellow │
│ R2-D2 │ NULL │ white, blue │ red │
│ Darth Vader │ none │ white │ yellow │
│ Leia Organa │ brown │ light │ brown │
│ Owen Lars │ brown, grey │ light │ blue │
│ Beru Whitesun lars │ brown │ light │ blue │
│ R5-D4 │ NULL │ white, red │ red │
│ Biggs Darklighter │ black │ light │ brown │
│ Obi-Wan Kenobi │ auburn, white │ fair │ blue-gray │
│ … │ … │ … │ … │
└────────────────────┴───────────────┴─────────────┴───────────┘
Next Steps
Now that you’ve gotten an introduction to the common differences between dplyr and Ibis, head over to Getting started with ibis for a full introduction. If you’re familiar with SQL, check out Ibis for SQL users. If you’re familiar with pandas, take a look at Ibis for pandas users