Tutorial: Ibis for pandas users

Prerequisites

We recommend starting with the default (DuckDB) backend for a performant, fully-featured local experience. You can install Ibis with pip, conda, mamba, or pixi.

pip install 'ibis-framework[duckdb,examples]'
Warning

Note that the ibis-framework package is not the same as the ibis package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis module name.

conda install -c conda-forge ibis-duckdb
mamba install -c conda-forge ibis-duckdb
pixi add ibis-duckdb

Overview

Much of the syntax and many of the operations in Ibis are inspired by the pandas DataFrame but the primary domain of Ibis is SQL so there are some differences in how they operate.

For one thing, SQL (and therefore Ibis) makes no guarantees about row order, which is a key assumption that numpy/pandas users are used to. So two columns can’t be deterministically “lined up” unless they are actually part of the same Table. An outcome of this is that you can’t index into Columns by position (column.head(5) or column[3:5]). You can only index into Tables (table.head(5) or table[3:5]). So if you want the first 5 rows of a column, you have to do table.head(5).my_column, table.my_column.head(5) will not work.

Another difference between Ibis tables and pandas DataFrames are that many of the pandas DataFrame operations do in-place operations (they are “mutable”), whereas Ibis table operations always return a new table expression (“immutable”).

Finally, Ibis table expressions are lazy, meaning that as you build up a table expression, no computation is actually performed until you call an action method such as to_pandas. Only then does Ibis compile the table expression into SQL and send it to the backend. (Note that we’ll be using Ibis’ interactive mode to automatically execute queries at the end of each cell in this notebook. If you are using similar code in a program, you will have to add .to_pandas() to each operation that you want to evaluate.)

import ibis
import pandas as pd

ibis.options.interactive = True

We’ll be using the DuckDB backend in Ibis in the examples below. First we’ll create a simple DataFrame.

df = pd.DataFrame(
    [["a", 1, 2], ["b", 3, 4]],
    columns=["one", "two", "three"],
    index=[5, 6],
)
df
one two three
5 a 1 2
6 b 3 4

Now we can create an Ibis table from the above DataFrame.

Note that the index from the pandas DataFrame is dropped. Ibis has no notion of an index: If you want to use the index, you will need to turn it into a column.

t = ibis.memtable(df, name="t")
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ one     two    three ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     12 │
│ b     34 │
└────────┴───────┴───────┘

Data types

The data types of columns in pandas are accessed using the dtypes attribute. This returns a Series object.

df.dtypes
one      object
two       int64
three     int64
dtype: object

In Ibis, you use the schema method which returns an ibis.Schema object.

t.schema()
ibis.Schema {
  one    string
  two    int64
  three  int64
}

It is possible to convert the schema information to pandas data types using the to_pandas method, if needed.

t.schema().to_pandas()
[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]

Table layout

In pandas, the layout of the table is contained in the shape attribute which contains the number of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten from the length of the schema.

len(t.schema())
3

To get the number of rows of a table, you use the count method.

t.count()

2

To mimic pandas’ behavior, you would use the following code. Note that you need to use the to_pandas method after count to evaluate the expression returned by count.

(t.count().to_pandas(), len(t.schema()))
(2, 3)
df.shape
(2, 3)

Subsetting columns

Selecting columns is very similar to in pandas. In fact, you can use the same syntax.

t[["one", "two"]]
┏━━━━━━━━┳━━━━━━━┓
┃ one     two   ┃
┡━━━━━━━━╇━━━━━━━┩
│ stringint64 │
├────────┼───────┤
│ a     1 │
│ b     3 │
└────────┴───────┘

However, since row-level indexing is not supported in Ibis, the inner list is not necessary.

t["one", "two"]
┏━━━━━━━━┳━━━━━━━┓
┃ one     two   ┃
┡━━━━━━━━╇━━━━━━━┩
│ stringint64 │
├────────┼───────┤
│ a     1 │
│ b     3 │
└────────┴───────┘

Selecting columns

Selecting columns is done using the same syntax as in pandas DataFrames. You can use either the indexing syntax or attribute syntax.

t["one"]
┏━━━━━━━━┓
┃ one    ┃
┡━━━━━━━━┩
│ string │
├────────┤
│ a      │
│ b      │
└────────┘

or:

t.one
┏━━━━━━━━┓
┃ one    ┃
┡━━━━━━━━┩
│ string │
├────────┤
│ a      │
│ b      │
└────────┘

Adding, removing, and modifying columns

Modifying the columns of an Ibis table is a bit different than doing the same operations in a pandas DataFrame. This is primarily due to the fact that in-place operations are not supported on Ibis tables. Each time you do a column modification to a table, a new table expression is returned.

Adding columns

Adding columns is done through the mutate method.

mutated = t.mutate(new_col=t.three * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ one     two    three  new_col ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ stringint64int64int64   │
├────────┼───────┼───────┼─────────┤
│ a     124 │
│ b     348 │
└────────┴───────┴───────┴─────────┘

Notice that the original table object remains unchanged. Only the mutated object that was returned contains the new column.

t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ one     two    three ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     12 │
│ b     34 │
└────────┴───────┴───────┘

It is also possible to create a column in isolation. This is similar to a Series in pandas. Note that the name of the column by default is a representation of the expression:

unnamed = t.three * 2
unnamed
┏━━━━━━━━━━━━━━━━━━━━┓
┃ Multiply(three, 2) ┃
┡━━━━━━━━━━━━━━━━━━━━┩
│ int64              │
├────────────────────┤
│                  4 │
│                  8 │
└────────────────────┘

To get a version with a specific name, you can use the name method:

new_col = unnamed.name("new_col")
new_col
┏━━━━━━━━━┓
┃ new_col ┃
┡━━━━━━━━━┩
│ int64   │
├─────────┤
│       4 │
│       8 │
└─────────┘

You can then add this column to the table using a projection.

proj = t["one", "two", new_col]
proj
┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ one     two    new_col ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ stringint64int64   │
├────────┼───────┼─────────┤
│ a     14 │
│ b     38 │
└────────┴───────┴─────────┘

Removing columns

Removing a column is done using the drop method.

t.columns
['one', 'two', 'three']
subset = t.drop("one", "two")
subset.columns
['three']

It is also possible to drop columns by selecting the columns you want to remain.

subset = t["two", "three"]
subset.columns
['two', 'three']

Modifying columns

Replacing existing columns is done using the mutate method just like adding columns. You add a column of the same name to replace it.

t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ one     two    three ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     12 │
│ b     34 │
└────────┴───────┴───────┘
mutated = t.mutate(two=t.two * 2)
mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ one     two    three ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     22 │
│ b     64 │
└────────┴───────┴───────┘

Renaming columns

In addition to replacing columns, you can rename them as well. This is done with the rename method which takes a dictionary containing the name mappings.

renamed = t.rename(
    dict(
        a="one",
        b="two",
    )
)
renamed
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ a       b      three ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     12 │
│ b     34 │
└────────┴───────┴───────┘

Selecting rows

There are several methods that can be used to select rows of data in various ways. These are described in the sections below. We’ll use the Palmer Penguins\(^1\) dataset to investigate! Ibis has several built-in example datasets that you can access using the ibis.examples module.

\(^1\): Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

penguins = ibis.examples.penguins.fetch()
penguins
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen34.118.11933475NULL2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Head, tail and limit

The head method works the same ways as in pandas. Note that some Ibis backends may not have an inherent ordering of their rows and using head may not return deterministic results. In those cases, you can use sorting before calling head to ensure a stable result.

penguins.head(5)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

However, the tail method is not implemented since it is not supported in all databases. It is possible to emulate the tail method if you use sorting in your table to do a reverse sort then use the head method to retrieve the “top” rows.

Another way to limit the number of retrieved rows is using the limit method. The following will return the same result as head(5). This is often used in conjunction with other filtering techniques that we will cover later.

penguins.limit(5)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Filtering rows

In addition to limiting the number of rows that are returned, it is possible to filter the rows using expressions. Expressions are constructed very similarly to the way they are in pandas. Ibis expressions are constructed from operations on columns in a table which return a boolean result. This result is then used to filter the table.

expr = penguins.bill_length_mm > 37.0
expr
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Greater(bill_length_mm, 37.0) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ boolean                       │
├───────────────────────────────┤
│ True                          │
│ True                          │
│ True                          │
│ NULL                          │
│ False                         │
│ True                          │
│ True                          │
│ True                          │
│ False                         │
│ True                          │
│                              │
└───────────────────────────────┘

We can evaluate the value counts to see how many rows we will expect to get back after filtering.

expr.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Greater(bill_length_mm, 37.0)  Greater(bill_length_mm, 37.0)_count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ booleanint64                               │
├───────────────────────────────┼─────────────────────────────────────┤
│ False                         │                                  42 │
│ NULL2 │
│ True                          │                                 300 │
└───────────────────────────────┴─────────────────────────────────────┘

Now we apply the filter to the table. Since there are 6 True values in the expression, we should get 6 rows back.

filtered = penguins[expr]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│ Adelie Torgersen37.817.11863300NULL2007 │
│ Adelie Torgersen37.817.31803700NULL2007 │
│ Adelie Torgersen41.117.61823200female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Of course, the filtering expression can be applied inline as well.

filtered = penguins[penguins.bill_length_mm > 37.0]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│ Adelie Torgersen37.817.11863300NULL2007 │
│ Adelie Torgersen37.817.31803700NULL2007 │
│ Adelie Torgersen41.117.61823200female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Multiple filtering expressions can be combined into a single expression or chained onto existing table expressions.

filtered = penguins[(penguins.bill_length_mm > 37.0) & (penguins.bill_depth_mm > 18.0)]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│ Adelie Torgersen38.621.21913800male  2007 │
│ Adelie Torgersen38.719.01953450female2007 │
│ Adelie Torgersen42.520.71974500male  2007 │
│ Adelie Torgersen46.021.51944200male  2007 │
│ Adelie Biscoe   37.818.31743400female2007 │
│ Adelie Biscoe   37.718.71803600male  2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

The code above will return the same rows as the code below.

filtered = penguins[penguins.bill_length_mm > 37.0][penguins.bill_depth_mm > 18.0]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│ Adelie Torgersen38.621.21913800male  2007 │
│ Adelie Torgersen38.719.01953450female2007 │
│ Adelie Torgersen42.520.71974500male  2007 │
│ Adelie Torgersen46.021.51944200male  2007 │
│ Adelie Biscoe   37.818.31743400female2007 │
│ Adelie Biscoe   37.718.71803600male  2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Aggregation has not been discussed yet, but aggregate values can be used in expressions to return things such as all of the rows in a data set where the value in a column is greater than the mean.

filtered = penguins[penguins.bill_length_mm > penguins.bill_length_mm.mean()]
filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen46.021.51944200male  2007 │
│ Adelie Dream    44.119.71964400male  2007 │
│ Adelie Torgersen45.818.91974150male  2008 │
│ Adelie Biscoe   45.620.31914600male  2009 │
│ Adelie Torgersen44.118.02104000male  2009 │
│ Gentoo Biscoe   46.113.22114500female2007 │
│ Gentoo Biscoe   50.016.32305700male  2007 │
│ Gentoo Biscoe   48.714.12104450female2007 │
│ Gentoo Biscoe   50.015.22185700male  2007 │
│ Gentoo Biscoe   47.614.52155400male  2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Modifying rows

Sometimes you want to modify the values in a column based on some condition. In pandas, you would do something like df.loc[condition] = new_value. In Ibis though, remember that all expressions are immutable, so you need to create a new table expression with the modified values. You do this using the ifelse method on boolean columns:

long_billed_penguins = penguins.bill_length_mm > 37.0
species_modified = long_billed_penguins.ifelse("wide", penguins.species)
penguins.mutate(species_modified=species_modified)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year   species_modified ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64string           │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼──────────────────┤
│ Adelie Torgersen39.118.71813750male  2007wide             │
│ Adelie Torgersen39.517.41863800female2007wide             │
│ Adelie Torgersen40.318.01953250female2007wide             │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007Adelie           │
│ Adelie Torgersen36.719.31933450female2007Adelie           │
│ Adelie Torgersen39.320.61903650male  2007wide             │
│ Adelie Torgersen38.917.81813625female2007wide             │
│ Adelie Torgersen39.219.61954675male  2007wide             │
│ Adelie Torgersen34.118.11933475NULL2007Adelie           │
│ Adelie Torgersen42.020.21904250NULL2007wide             │
│                 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴──────────────────┘

Sorting rows

Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the sort_values method to order rows by values in specified columns. Ibis uses a method called order_by. To specify ascending or descending orders, pandas uses an ascending= argument to sort_values that indicates the order for each sorting column. Ibis allows you to tag the column name in the order_by list as ascending or descending by wrapping it with ibis.asc or ibis.desc.

First, let’s ask Ibis for a pandas DataFrame version of the penguin data:

df = penguins.to_pandas()

Here is an example of sorting a DataFrame using two sort keys. One key is sorting in ascending order and the other is in descending order.

df.sort_values(
    ["bill_length_mm", "bill_depth_mm"], ascending=[True, False], na_position="first"
).head(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
3 Adelie Torgersen NaN NaN NaN NaN None 2007
271 Gentoo Biscoe NaN NaN NaN NaN None 2009
142 Adelie Dream 32.1 15.5 188.0 3050.0 female 2009
98 Adelie Dream 33.1 16.1 178.0 2900.0 female 2008
70 Adelie Torgersen 33.5 19.0 190.0 3600.0 female 2008

The same operation in Ibis would look like the following. Note that the index values of the resulting DataFrame start from zero and count up, whereas in the example above, they retain their original index value. This is because rows in tables don’t necessarily have a stable index in database backends, so the index is generated on the result.

sorted = penguins.order_by(["bill_length_mm", ibis.desc("bill_depth_mm")]).head(5)
sorted
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Dream    32.115.51883050female2009 │
│ Adelie Dream    33.116.11782900female2008 │
│ Adelie Torgersen33.519.01903600female2008 │
│ Adelie Dream    34.017.11853400female2008 │
│ Adelie Torgersen34.118.11933475NULL2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Aggregation

Aggregation in pandas is typically done by computing columns based on an aggregate function.

stats = [df.bill_depth_mm.sum(), df.bill_length_mm.mean()]
pd.DataFrame([stats], columns=["total_bill_depth", "avg.bill_length"])
total_bill_depth avg.bill_length
0 5865.7 43.92193

In Ibis, you construct aggregate expressions then apply them to the table using the aggregate method.

stats = [
    penguins.bill_depth_mm.sum().name("total_bill_width"),
    penguins.bill_length_mm.mean().name("avg_bill_length"),
]
agged = penguins.aggregate(stats)
agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ total_bill_width  avg_bill_length ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ float64float64         │
├──────────────────┼─────────────────┤
│           5865.743.92193 │
└──────────────────┴─────────────────┘

You can also combine both operations into one and pass the aggregate expressions using keyword parameters.

agged = penguins.aggregate(
    total_bill_depth=penguins.bill_depth_mm.sum(),
    avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ total_bill_depth  avg_bill_length ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ float64float64         │
├──────────────────┼─────────────────┤
│           5865.743.92193 │
└──────────────────┴─────────────────┘

Group by

Using a similar API as above, aggregations can also be done across groupings using the by= parameter.

agged = penguins.aggregate(
    by="species",
    total_bill_depth=penguins.bill_depth_mm.sum(),
    avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ species    total_bill_depth  avg_bill_length ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ stringfloat64float64         │
├───────────┼──────────────────┼─────────────────┤
│ Adelie   2770.338.791391 │
│ Chinstrap1252.648.833824 │
│ Gentoo   1842.847.504878 │
└───────────┴──────────────────┴─────────────────┘

Alternatively, by groups can be computed using a grouped table.

agged = penguins.group_by("species").aggregate(
    total_bill_depth=penguins.bill_depth_mm.sum(),
    avg_bill_length=penguins.bill_length_mm.mean(),
)
agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ species    total_bill_depth  avg_bill_length ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ stringfloat64float64         │
├───────────┼──────────────────┼─────────────────┤
│ Adelie   2770.338.791391 │
│ Chinstrap1252.648.833824 │
│ Gentoo   1842.847.504878 │
└───────────┴──────────────────┴─────────────────┘

You can group over multiple columns too, and rename them if you want.

If you only need to aggregate over a single column, then you don’t need to use the .aggregate() method.

penguins.group_by(["species", "sex"], location="island").body_mass_g.approx_median()
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species    sex     location   approx_median(body_mass_g) ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringint64                      │
├───────────┼────────┼───────────┼────────────────────────────┤
│ Adelie   femaleDream    3400 │
│ Adelie   male  Dream    3988 │
│ Gentoo   femaleBiscoe   4700 │
│ Gentoo   NULLBiscoe   4688 │
│ Adelie   NULLTorgersen3588 │
│ Adelie   NULLDream    2975 │
│ Gentoo   male  Biscoe   5500 │
│ ChinstrapfemaleDream    3550 │
│ Chinstrapmale  Dream    3950 │
│ Adelie   male  Torgersen4000 │
│  │
└───────────┴────────┴───────────┴────────────────────────────┘

Instead of aggregating after a group by, you can also transform the table so that the output table has the same number of rows as the input table. This is analogous to the groupby().transform() pattern in pandas. You can pass complex expressions to compute per-group:

# Calculate how much the mass of each penguin deviates from the mean
penguins.group_by(["species", "sex"]).mutate(
    # This column isn't needed, but it makes it easier to see what's going on
    mass_mean=penguins.body_mass_g.mean(),
    mass_deviation=penguins.body_mass_g - penguins.body_mass_g.mean(),
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year   mass_mean    mass_deviation ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64float64float64        │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼─────────────┼────────────────┤
│ Gentoo Biscoe50.016.32305700male  20075484.836066215.163934 │
│ Gentoo Biscoe49.916.12135400male  20095484.836066-84.836066 │
│ Gentoo Biscoe47.614.52155400male  20075484.836066-84.836066 │
│ Gentoo Biscoe46.715.32195200male  20075484.836066-284.836066 │
│ Gentoo Biscoe46.815.42155150male  20075484.836066-334.836066 │
│ Gentoo Biscoe49.016.12165550male  20075484.83606665.163934 │
│ Gentoo Biscoe48.414.62135850male  20075484.836066365.163934 │
│ Gentoo Biscoe49.315.72175850male  20075484.836066365.163934 │
│ Gentoo Biscoe49.215.22216300male  20075484.836066815.163934 │
│ Gentoo Biscoe48.715.12225350male  20075484.836066-134.836066 │
│  │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴─────────────┴────────────────┘

NULL values

Ibis has first-class support for NULL values. In pandas and numpy, NULL values are represented by NaN. This can be confusing when working with numeric data, since NaN is also a valid floating point value (along with +/-inf).

In Ibis, we try to be more precise: All data types are nullable, and we use ibis.NA to represent NULL values, and all datatypes have a .isnull() method. For floating point values, we use different values for NaN and +/-inf, and there are the additional methods .isnan() and .isinf().

Dropping rows with NULLs

Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns contains a NULL value. This method is called dropna in both packages. The common set of parameters in the two are subset= and how=. The subset= parameter indicates which columns to inspect for NULL values. The how= parameter specifies whether ‘any’ or ‘all’ of the specified columns must be NULL in order for the row to be dropped.

no_null_peng = penguins.dropna(["bill_depth_mm", "bill_length_mm"], how="any")

Filling NULL values

Both pandas and Ibis allow you to fill NULL values in a table. In Ibis, the replacement value can only be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify the column name for the value to apply to.

no_null_peng = penguins.fillna(dict(bill_depth_mm=0, bill_length_mm=0))

Replacing NULLs

Both pandas and Ibis have fillna methods which allow you to specify a replacement value for NULL values.

bill_length_no_nulls = penguins.bill_length_mm.fillna(0)

Type casts

Type casting in pandas is done using the astype method on columns.

df.bill_depth_mm.astype(str)
0      18.7
1      17.4
2      18.0
3       nan
4      19.3
       ... 
339    19.8
340    18.1
341    18.2
342    19.0
343    18.7
Name: bill_depth_mm, Length: 344, dtype: object

In Ibis, you cast the column type using the cast method.

penguins.bill_depth_mm.cast("int")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Cast(bill_depth_mm, int64) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                      │
├────────────────────────────┤
│                         19 │
│                         17 │
│                         18 │
│                       NULL │
│                         19 │
│                         21 │
│                         18 │
│                         20 │
│                         18 │
│                         20 │
│                           │
└────────────────────────────┘

Casted columns can be assigned back to the table using the mutate method described earlier.

casted = penguins.mutate(
    bill_depth_mm=penguins.bill_depth_mm.cast("int"),
    bill_length_mm=penguins.bill_length_mm.cast("int"),
)
casted.schema()
ibis.Schema {
  species            string
  island             string
  bill_length_mm     int64
  bill_depth_mm      int64
  flipper_length_mm  int64
  body_mass_g        int64
  sex                string
  year               int64
}

Set membership

pandas set membership uses the in and not in operators such as 'a' in df.species. Ibis uses isin and notin methods. In addition to testing membership in a set, these methods allow you to specify an else case to assign a value when the value isn’t in the set.

penguins.species.value_counts()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ species    species_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ stringint64         │
├───────────┼───────────────┤
│ Adelie   152 │
│ Chinstrap68 │
│ Gentoo   124 │
└───────────┴───────────────┘
refined = penguins.species.isin(["Adelie", "Chinstrap"])
refined.value_counts()
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ InValues(species)  InValues(species)_count ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ booleanint64                   │
├───────────────────┼─────────────────────────┤
│ False             │                     124 │
│ True              │                     220 │
└───────────────────┴─────────────────────────┘

Merging tables

While pandas uses the merge method to combine data from multiple DataFrames, Ibis uses the join method. They both have similar capabilities.

The biggest difference between Ibis’ join method and pandas’ merge method is that pandas only accepts column names or index levels to join on, whereas Ibis can merge on expressions.

Here are some examples of merging using pandas.

df_left = pd.DataFrame(
    [
        ["a", 1, 2],
        ["b", 3, 4],
        ["c", 4, 6],
    ],
    columns=["name", "x", "y"],
)

df_right = pd.DataFrame(
    [
        ["a", 100, 200],
        ["m", 300, 400],
        ["n", 400, 600],
    ],
    columns=["name", "x_100", "y_100"],
)
df_left.merge(df_right, on="name")
name x y x_100 y_100
0 a 1 2 100 200
df_left.merge(df_right, on="name", how="outer")
name x y x_100 y_100
0 a 1.0 2.0 100.0 200.0
1 b 3.0 4.0 NaN NaN
2 c 4.0 6.0 NaN NaN
3 m NaN NaN 300.0 400.0
4 n NaN NaN 400.0 600.0

We can now convert DataFrames to Ibis tables to do joins.

t_left = ibis.memtable(df_left, name="t_left")
t_right = ibis.memtable(df_right, name="t_right")
t_left.join(t_right, t_left.name == t_right.name)
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ name    x      y      x_100  y_100 ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64int64int64 │
├────────┼───────┼───────┼───────┼───────┤
│ a     12100200 │
└────────┴───────┴───────┴───────┴───────┘

Below is an outer join where missing values are filled with NaN.

t_left.join(t_right, t_left.name == t_right.name, how="outer")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ name    x      y      name_right  x_100  y_100 ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64stringint64int64 │
├────────┼───────┼───────┼────────────┼───────┼───────┤
│ a     12a         100200 │
│ b     34NULLNULLNULL │
│ c     46NULLNULLNULL │
│ NULLNULLNULLm         300400 │
│ NULLNULLNULLn         400600 │
└────────┴───────┴───────┴────────────┴───────┴───────┘

Concatenating tables

Concatenating DataFrames in pandas is done with the concat top-level function. It takes multiple DataFrames and concatenates the rows of one DataFrame to the next. If the columns are mis-matched, it extends the list of columns to include the full set of columns and inserts NaNs and Nones into the missing values.

Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done using the top-level union function or the union method on a table.

We’ll demonstrate a pandas concat first.

df_1 = pd.DataFrame(
    [
        ["a", 1, 2],
        ["b", 3, 4],
        ["c", 4, 6],
    ],
    columns=["name", "x", "y"],
)

df_2 = pd.DataFrame(
    [
        ["a", 100, 200],
        ["m", 300, 400],
        ["n", 400, 600],
    ],
    columns=["name", "x", "y"],
)
pd.concat([df_1, df_2])
name x y
0 a 1 2
1 b 3 4
2 c 4 6
0 a 100 200
1 m 300 400
2 n 400 600

Now we can convert the DataFrames to Ibis tables and combine the tables using a union.

t_1 = ibis.memtable(df_1, name="t_1")
t_2 = ibis.memtable(df_2, name="t_2")
unioned = ibis.union(t_1, t_2)
unioned
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ name    x      y     ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ stringint64int64 │
├────────┼───────┼───────┤
│ a     12 │
│ b     34 │
│ c     46 │
│ a     100200 │
│ m     300400 │
│ n     400600 │
└────────┴───────┴───────┘
Back to top