import ibis
import pandas as pd
= True ibis.options.interactive
Tutorial: Ibis for pandas users
Prerequisites
We recommend starting with the default (DuckDB) backend for a performant, fully-featured local experience. You can install Ibis with pip
, conda
, mamba
, or pixi
.
pip install 'ibis-framework[duckdb,examples]'
Note that the ibis-framework
package is not the same as the ibis
package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis
module name.
conda install -c conda-forge ibis-duckdb
mamba install -c conda-forge ibis-duckdb
pixi add ibis-duckdb
Overview
Much of the syntax and many of the operations in Ibis are inspired by the pandas DataFrame
but the primary domain of Ibis is SQL so there are some differences in how they operate.
For one thing, SQL (and therefore Ibis) makes no guarantees about row order, which is a key assumption that numpy/pandas users are used to. So two columns can’t be deterministically “lined up” unless they are actually part of the same Table. An outcome of this is that you can’t index into Columns by position (column.head(5)
or column[3:5]
). You can only index into Tables (table.head(5)
or table[3:5]
). So if you want the first 5 rows of a column, you have to do table.head(5).my_column
, table.my_column.head(5)
will not work.
Another difference between Ibis tables and pandas DataFrame
s are that many of the pandas DataFrame
operations do in-place operations (they are “mutable”), whereas Ibis table operations always return a new table expression (“immutable”).
Finally, Ibis table expressions are lazy, meaning that as you build up a table expression, no computation is actually performed until you call an action method such as to_pandas
. Only then does Ibis compile the table expression into SQL and send it to the backend. (Note that we’ll be using Ibis’ interactive mode to automatically execute queries at the end of each cell in this notebook. If you are using similar code in a program, you will have to add .to_pandas()
to each operation that you want to evaluate.)
We’ll be using the DuckDB backend in Ibis in the examples below. First we’ll create a simple DataFrame
.
= pd.DataFrame(
df "a", 1, 2], ["b", 3, 4]],
[[=["one", "two", "three"],
columns=[5, 6],
index
) df
one | two | three | |
---|---|---|---|
5 | a | 1 | 2 |
6 | b | 3 | 4 |
Now we can create an Ibis table from the above DataFrame
.
Note that the index from the pandas DataFrame
is dropped. Ibis has no notion of an index: If you want to use the index, you will need to turn it into a column.
= ibis.memtable(df, name="t")
t t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Data types
The data types of columns in pandas are accessed using the dtypes
attribute. This returns a Series
object.
df.dtypes
one object
two int64
three int64
dtype: object
In Ibis, you use the schema
method which returns an ibis.Schema
object.
t.schema()
ibis.Schema {
one string
two int64
three int64
}
It is possible to convert the schema information to pandas data types using the to_pandas
method, if needed.
t.schema().to_pandas()
[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]
Table layout
In pandas, the layout of the table is contained in the shape
attribute which contains the number of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten from the length of the schema.
len(t.schema())
3
To get the number of rows of a table, you use the count
method.
t.count()
┌───┐
│ 2 │
└───┘
To mimic pandas’ behavior, you would use the following code. Note that you need to use the to_pandas
method after count
to evaluate the expression returned by count
.
len(t.schema())) (t.count().to_pandas(),
(2, 3)
df.shape
(2, 3)
Subsetting columns
Selecting columns is very similar to in pandas. In fact, you can use the same syntax.
"one", "two"]] t[[
┏━━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ ┡━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├────────┼───────┤ │ a │ 1 │ │ b │ 3 │ └────────┴───────┘
Selecting columns
Selecting columns is done using the same syntax as in pandas DataFrames
. You can use either the indexing syntax or attribute syntax.
"one"] t[
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
or:
t.one
┏━━━━━━━━┓ ┃ one ┃ ┡━━━━━━━━┩ │ string │ ├────────┤ │ a │ │ b │ └────────┘
Adding, removing, and modifying columns
Modifying the columns of an Ibis table is a bit different than doing the same operations in a pandas DataFrame
. This is primarily due to the fact that in-place operations are not supported on Ibis tables. Each time you do a column modification to a table, a new table expression is returned.
Adding columns
Adding columns is done through the mutate
method.
= t.mutate(new_col=t.three * 2)
mutated mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ three ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼─────────┤ │ a │ 1 │ 2 │ 4 │ │ b │ 3 │ 4 │ 8 │ └────────┴───────┴───────┴─────────┘
Notice that the original table object remains unchanged. Only the mutated
object that was returned contains the new column.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
It is also possible to create a column in isolation. This is similar to a Series
in pandas. Note that the name of the column by default is a representation of the expression:
= t.three * 2
unnamed unnamed
┏━━━━━━━━━━━━━━━━━━━━┓ ┃ Multiply(three, 2) ┃ ┡━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────┤ │ 4 │ │ 8 │ └────────────────────┘
To get a version with a specific name, you can use the name
method:
= unnamed.name("new_col")
new_col new_col
┏━━━━━━━━━┓ ┃ new_col ┃ ┡━━━━━━━━━┩ │ int64 │ ├─────────┤ │ 4 │ │ 8 │ └─────────┘
You can then add this column to the table using mutate
= t.mutate(new_col)
proj proj
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┓ ┃ one ┃ two ┃ three ┃ new_col ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼─────────┤ │ a │ 1 │ 2 │ 4 │ │ b │ 3 │ 4 │ 8 │ └────────┴───────┴───────┴─────────┘
Removing columns
Removing a column is done using the drop
method.
t.columns
('one', 'two', 'three')
= t.drop("one", "two")
subset subset.columns
('three',)
It is also possible to drop columns by selecting the columns you want to remain.
= t["two", "three"]
subset subset.columns
('two', 'three')
Modifying columns
Replacing existing columns is done using the mutate
method just like adding columns. You add a column of the same name to replace it.
t
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
= t.mutate(two=t.two * 2)
mutated mutated
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ one ┃ two ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 2 │ 2 │ │ b │ 6 │ 4 │ └────────┴───────┴───────┘
Renaming columns
In addition to replacing columns, you can rename them as well. This is done with the rename
method which takes a dictionary containing the name mappings.
= t.rename(
renamed dict(
="one",
a="two",
b
)
) renamed
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ a ┃ b ┃ three ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ └────────┴───────┴───────┘
Selecting rows
There are several methods that can be used to select rows of data in various ways. These are described in the sections below. We’ll use the Palmer Penguins\(^1\) dataset to investigate! Ibis has several built-in example datasets that you can access using the ibis.examples
module.
\(^1\): Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
= ibis.examples.penguins.fetch() penguins
penguins
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Head, tail and limit
The head
method works the same ways as in pandas. Note that some Ibis backends may not have an inherent ordering of their rows and using head
may not return deterministic results. In those cases, you can use sorting before calling head
to ensure a stable result.
5) penguins.head(
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
However, the tail method is not implemented since it is not supported in all databases. It is possible to emulate the tail
method if you use sorting in your table to do a reverse sort then use the head
method to retrieve the “top” rows.
Another way to limit the number of retrieved rows is using the limit
method. The following will return the same result as head(5)
. This is often used in conjunction with other filtering techniques that we will cover later.
5) penguins.limit(
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Filtering rows
In addition to limiting the number of rows that are returned, it is possible to filter the rows using expressions. This is done using the filter
method in ibis. Ibis expressions are constructed from operations on columns in a table which return a boolean result. This result is then used to filter the table.
= penguins.bill_length_mm > 37.0
expr expr
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Greater(bill_length_mm, 37.0) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ ├───────────────────────────────┤ │ True │ │ True │ │ True │ │ NULL │ │ False │ │ True │ │ True │ │ True │ │ False │ │ True │ │ … │ └───────────────────────────────┘
We can evaluate the value counts to see how many rows we will expect to get back after filtering.
expr.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Greater(bill_length_mm, 37.0) ┃ Greater(bill_length_mm, 37.0)_count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ int64 │ ├───────────────────────────────┼─────────────────────────────────────┤ │ False │ 42 │ │ NULL │ 2 │ │ True │ 300 │ └───────────────────────────────┴─────────────────────────────────────┘
Now we apply the filter to the table. Since there are 6 True values in the expression, we should get 6 rows back.
= penguins.filter(expr)
filtered filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186 │ 3300 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.3 │ 180 │ 3700 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 41.1 │ 17.6 │ 182 │ 3200 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Of course, the filtering expression can be applied inline as well.
= penguins.filter(penguins.bill_length_mm > 37.0)
filtered filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186 │ 3300 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 37.8 │ 17.3 │ 180 │ 3700 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 41.1 │ 17.6 │ 182 │ 3200 │ female │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Multiple filtering expressions may be passed in to a single call (filtering only rows where they’re all true), or combined together using common boolean operators like (&
, |
). The expressions below are equivalent:
= penguins.filter(penguins.bill_length_mm > 37.0, penguins.bill_depth_mm > 18.0)
filtered filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 38.6 │ 21.2 │ 191 │ 3800 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.7 │ 19.0 │ 195 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 42.5 │ 20.7 │ 197 │ 4500 │ male │ 2007 │ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Biscoe │ 37.8 │ 18.3 │ 174 │ 3400 │ female │ 2007 │ │ Adelie │ Biscoe │ 37.7 │ 18.7 │ 180 │ 3600 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
= penguins.filter((penguins.bill_length_mm > 37.0) & (penguins.bill_depth_mm > 18.0))
filtered filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 38.6 │ 21.2 │ 191 │ 3800 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.7 │ 19.0 │ 195 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 42.5 │ 20.7 │ 197 │ 4500 │ male │ 2007 │ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Biscoe │ 37.8 │ 18.3 │ 174 │ 3400 │ female │ 2007 │ │ Adelie │ Biscoe │ 37.7 │ 18.7 │ 180 │ 3600 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Aggregation has not been discussed yet, but aggregate values can be used in expressions to return things such as all of the rows in a data set where the value in a column is greater than the mean.
= penguins.filter(penguins.bill_length_mm > penguins.bill_length_mm.mean())
filtered filtered
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 46.0 │ 21.5 │ 194 │ 4200 │ male │ 2007 │ │ Adelie │ Dream │ 44.1 │ 19.7 │ 196 │ 4400 │ male │ 2007 │ │ Adelie │ Torgersen │ 45.8 │ 18.9 │ 197 │ 4150 │ male │ 2008 │ │ Adelie │ Biscoe │ 45.6 │ 20.3 │ 191 │ 4600 │ male │ 2009 │ │ Adelie │ Torgersen │ 44.1 │ 18.0 │ 210 │ 4000 │ male │ 2009 │ │ Gentoo │ Biscoe │ 46.1 │ 13.2 │ 211 │ 4500 │ female │ 2007 │ │ Gentoo │ Biscoe │ 50.0 │ 16.3 │ 230 │ 5700 │ male │ 2007 │ │ Gentoo │ Biscoe │ 48.7 │ 14.1 │ 210 │ 4450 │ female │ 2007 │ │ Gentoo │ Biscoe │ 50.0 │ 15.2 │ 218 │ 5700 │ male │ 2007 │ │ Gentoo │ Biscoe │ 47.6 │ 14.5 │ 215 │ 5400 │ male │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Modifying rows
Sometimes you want to modify the values in a column based on some condition. In pandas, you would do something like df.loc[condition] = new_value
. In Ibis though, remember that all expressions are immutable, so you need to create a new table expression with the modified values. You do this using the ifelse
method on boolean columns:
= penguins.bill_length_mm > 37.0
long_billed_penguins = long_billed_penguins.ifelse("wide", penguins.species)
species_modified =species_modified) penguins.mutate(species_modified
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ species_modified ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ string │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼──────────────────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ wide │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ wide │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ Adelie │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ wide │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴──────────────────┘
Sorting rows
Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the sort_values
method to order rows by values in specified columns. Ibis uses a method called order_by
. To specify ascending or descending orders, pandas uses an ascending=
argument to sort_values
that indicates the order for each sorting column. Ibis allows you to tag the column name in the order_by
list as ascending or descending by wrapping it with ibis.asc
or ibis.desc
.
First, let’s ask Ibis for a pandas DataFrame
version of the penguin data:
= penguins.to_pandas() df
Here is an example of sorting a DataFrame
using two sort keys. One key is sorting in ascending order and the other is in descending order.
df.sort_values("bill_length_mm", "bill_depth_mm"], ascending=[True, False], na_position="first"
[5) ).head(
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | None | 2007 |
271 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | None | 2009 |
142 | Adelie | Dream | 32.1 | 15.5 | 188.0 | 3050.0 | female | 2009 |
98 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | female | 2008 |
70 | Adelie | Torgersen | 33.5 | 19.0 | 190.0 | 3600.0 | female | 2008 |
The same operation in Ibis would look like the following. Note that the index values of the resulting DataFrame
start from zero and count up, whereas in the example above, they retain their original index value. This is because rows in tables don’t necessarily have a stable index in database backends, so the index is generated on the result.
sorted = penguins.order_by(["bill_length_mm", ibis.desc("bill_depth_mm")]).head(5)
sorted
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Dream │ 32.1 │ 15.5 │ 188 │ 3050 │ female │ 2009 │ │ Adelie │ Dream │ 33.1 │ 16.1 │ 178 │ 2900 │ female │ 2008 │ │ Adelie │ Torgersen │ 33.5 │ 19.0 │ 190 │ 3600 │ female │ 2008 │ │ Adelie │ Dream │ 34.0 │ 17.1 │ 185 │ 3400 │ female │ 2008 │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Aggregation
Aggregation in pandas is typically done by computing columns based on an aggregate function.
= [df.bill_depth_mm.sum(), df.bill_length_mm.mean()]
stats =["total_bill_depth", "avg.bill_length"]) pd.DataFrame([stats], columns
total_bill_depth | avg.bill_length | |
---|---|---|
0 | 5865.7 | 43.92193 |
In Ibis, you construct aggregate expressions then apply them to the table using the aggregate
method.
= [
stats sum().name("total_bill_width"),
penguins.bill_depth_mm."avg_bill_length"),
penguins.bill_length_mm.mean().name(
]= penguins.aggregate(stats)
agged agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ total_bill_width ┃ avg_bill_length ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├──────────────────┼─────────────────┤ │ 5865.7 │ 43.92193 │ └──────────────────┴─────────────────┘
You can also combine both operations into one and pass the aggregate expressions using keyword parameters.
= penguins.aggregate(
agged =penguins.bill_depth_mm.sum(),
total_bill_depth=penguins.bill_length_mm.mean(),
avg_bill_length
) agged
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ float64 │ float64 │ ├──────────────────┼─────────────────┤ │ 5865.7 │ 43.92193 │ └──────────────────┴─────────────────┘
Group by
Using a similar API as above, aggregations can also be done across groupings using the by=
parameter.
= penguins.aggregate(
agged ="species",
by=penguins.bill_depth_mm.sum(),
total_bill_depth=penguins.bill_length_mm.mean(),
avg_bill_length
) agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├───────────┼──────────────────┼─────────────────┤ │ Gentoo │ 1842.8 │ 47.504878 │ │ Adelie │ 2770.3 │ 38.791391 │ │ Chinstrap │ 1252.6 │ 48.833824 │ └───────────┴──────────────────┴─────────────────┘
Alternatively, by groups can be computed using a grouped table.
= penguins.group_by("species").aggregate(
agged =penguins.bill_depth_mm.sum(),
total_bill_depth=penguins.bill_length_mm.mean(),
avg_bill_length
) agged
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ species ┃ total_bill_depth ┃ avg_bill_length ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ ├───────────┼──────────────────┼─────────────────┤ │ Gentoo │ 1842.8 │ 47.504878 │ │ Chinstrap │ 1252.6 │ 48.833824 │ │ Adelie │ 2770.3 │ 38.791391 │ └───────────┴──────────────────┴─────────────────┘
You can group over multiple columns too, and rename them if you want.
If you only need to aggregate over a single column, then you don’t need to use the .aggregate()
method.
"species", "sex"], location="island").body_mass_g.approx_median() penguins.group_by([
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ species ┃ sex ┃ location ┃ approx_median(body_mass_g) ┃ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ float64 │ ├───────────┼────────┼───────────┼────────────────────────────┤ │ Adelie │ NULL │ Torgersen │ 3588.0 │ │ Adelie │ male │ Torgersen │ 4000.0 │ │ Adelie │ female │ Torgersen │ 3400.0 │ │ Adelie │ female │ Biscoe │ 3375.0 │ │ Adelie │ male │ Biscoe │ 4000.0 │ │ Adelie │ NULL │ Dream │ 2975.0 │ │ Gentoo │ male │ Biscoe │ 5500.0 │ │ Chinstrap │ female │ Dream │ 3550.0 │ │ Chinstrap │ male │ Dream │ 3950.0 │ │ Adelie │ female │ Dream │ 3400.0 │ │ … │ … │ … │ … │ └───────────┴────────┴───────────┴────────────────────────────┘
Instead of aggregating after a group by, you can also transform the table so that the output table has the same number of rows as the input table. This is analogous to the groupby().transform()
pattern in pandas. You can pass complex expressions to compute per-group:
# Calculate how much the mass of each penguin deviates from the mean
"species", "sex"]).mutate(
penguins.group_by([# This column isn't needed, but it makes it easier to see what's going on
=penguins.body_mass_g.mean(),
mass_mean=penguins.body_mass_g - penguins.body_mass_g.mean(),
mass_deviation )
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ mass_mean ┃ mass_deviation ┃ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ float64 │ float64 │ ├───────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼─────────────┼────────────────┤ │ Chinstrap │ Dream │ 50.0 │ 19.5 │ 196 │ 3900 │ male │ 2007 │ 3938.970588 │ -38.970588 │ │ Chinstrap │ Dream │ 50.8 │ 19.0 │ 210 │ 4100 │ male │ 2009 │ 3938.970588 │ 161.029412 │ │ Chinstrap │ Dream │ 52.7 │ 19.8 │ 197 │ 3725 │ male │ 2007 │ 3938.970588 │ -213.970588 │ │ Chinstrap │ Dream │ 51.3 │ 18.2 │ 197 │ 3750 │ male │ 2007 │ 3938.970588 │ -188.970588 │ │ Chinstrap │ Dream │ 51.3 │ 19.9 │ 198 │ 3700 │ male │ 2007 │ 3938.970588 │ -238.970588 │ │ Chinstrap │ Dream │ 51.7 │ 20.3 │ 194 │ 3775 │ male │ 2007 │ 3938.970588 │ -163.970588 │ │ Chinstrap │ Dream │ 52.0 │ 18.1 │ 201 │ 4050 │ male │ 2007 │ 3938.970588 │ 111.029412 │ │ Chinstrap │ Dream │ 50.5 │ 19.6 │ 201 │ 4050 │ male │ 2007 │ 3938.970588 │ 111.029412 │ │ Chinstrap │ Dream │ 50.3 │ 20.0 │ 197 │ 3300 │ male │ 2007 │ 3938.970588 │ -638.970588 │ │ Chinstrap │ Dream │ 50.7 │ 19.7 │ 203 │ 4050 │ male │ 2009 │ 3938.970588 │ 111.029412 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └───────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴─────────────┴────────────────┘
NULL values
Ibis has first-class support for NULL
values. In pandas and numpy, NULL
values are represented by NaN
. This can be confusing when working with numeric data, since NaN
is also a valid floating point value (along with +/-inf
).
In Ibis, we try to be more precise: All data types are nullable, and we use ibis.null()
to represent NULL
values, and all datatypes have a .isnull()
method. For floating point values, we use different values for NaN
and +/-inf
, and there are the additional methods .isnan()
and .isinf()
.
Dropping rows with NULL
s
Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns contains a NULL
value. This method is called dropna
in both packages. The common set of parameters in the two are subset=
and how=
. The subset=
parameter indicates which columns to inspect for NULL
values. The how=
parameter specifies whether ‘any’ or ‘all’ of the specified columns must be NULL
in order for the row to be dropped.
= penguins.dropna(["bill_depth_mm", "bill_length_mm"], how="any") no_null_peng
Filling NULL
values
Both pandas and Ibis allow you to fill NULL
values in a table. In Ibis, the replacement value can only be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify the column name for the value to apply to.
= penguins.fill_null(dict(bill_depth_mm=0, bill_length_mm=0)) no_null_peng
Replacing NULL
s
The Ibis equivalent of pandas fillna
is fill_null
, this method allows you to specify a replacement value for NULL
values.
= penguins.bill_length_mm.fill_null(0) bill_length_no_nulls
Type casts
Type casting in pandas is done using the astype
method on columns.
str) df.bill_depth_mm.astype(
0 18.7
1 17.4
2 18.0
3 nan
4 19.3
...
339 19.8
340 18.1
341 18.2
342 19.0
343 18.7
Name: bill_depth_mm, Length: 344, dtype: object
In Ibis, you cast the column type using the cast
method.
"int") penguins.bill_depth_mm.cast(
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Cast(bill_depth_mm, int64) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────────────┤ │ 19 │ │ 17 │ │ 18 │ │ NULL │ │ 19 │ │ 21 │ │ 18 │ │ 20 │ │ 18 │ │ 20 │ │ … │ └────────────────────────────┘
Casted columns can be assigned back to the table using the mutate
method described earlier.
= penguins.mutate(
casted =penguins.bill_depth_mm.cast("int"),
bill_depth_mm=penguins.bill_length_mm.cast("int"),
bill_length_mm
) casted.schema()
ibis.Schema {
species string
island string
bill_length_mm int64
bill_depth_mm int64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
}
Set membership
pandas set membership uses the in
and not in
operators such as 'a' in df.species
. Ibis uses isin
and notin
methods. In addition to testing membership in a set, these methods allow you to specify an else case to assign a value when the value isn’t in the set.
penguins.species.value_counts()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ species ┃ species_count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ string │ int64 │ ├───────────┼───────────────┤ │ Chinstrap │ 68 │ │ Gentoo │ 124 │ │ Adelie │ 152 │ └───────────┴───────────────┘
= penguins.species.isin(["Adelie", "Chinstrap"])
refined refined.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ InValues(species, ('Adelie', 'Chinstrap')) ┃ InValues(species, ('Adelie', 'Chinstrap'))_count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ boolean │ int64 │ ├────────────────────────────────────────────┼──────────────────────────────────────────────────┤ │ False │ 124 │ │ True │ 220 │ └────────────────────────────────────────────┴──────────────────────────────────────────────────┘
Merging tables
While pandas uses the merge
method to combine data from multiple DataFrames
, Ibis uses the join
method. They both have similar capabilities.
The biggest difference between Ibis’ join
method and pandas’ merge
method is that pandas only accepts column names or index levels to join on, whereas Ibis can merge on expressions.
Here are some examples of merging using pandas.
= pd.DataFrame(
df_left
["a", 1, 2],
["b", 3, 4],
["c", 4, 6],
[
],=["name", "x", "y"],
columns
)
= pd.DataFrame(
df_right
["a", 100, 200],
["m", 300, 400],
["n", 400, 600],
[
],=["name", "x_100", "y_100"],
columns )
="name") df_left.merge(df_right, on
name | x | y | x_100 | y_100 | |
---|---|---|---|---|---|
0 | a | 1 | 2 | 100 | 200 |
="name", how="outer") df_left.merge(df_right, on
name | x | y | x_100 | y_100 | |
---|---|---|---|---|---|
0 | a | 1.0 | 2.0 | 100.0 | 200.0 |
1 | b | 3.0 | 4.0 | NaN | NaN |
2 | c | 4.0 | 6.0 | NaN | NaN |
3 | m | NaN | NaN | 300.0 | 400.0 |
4 | n | NaN | NaN | 400.0 | 600.0 |
We can now convert DataFrames
to Ibis tables to do join
s.
= ibis.memtable(df_left, name="t_left")
t_left = ibis.memtable(df_right, name="t_right") t_right
== t_right.name) t_left.join(t_right, t_left.name
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ int64 │ int64 │ ├────────┼───────┼───────┼───────┼───────┤ │ a │ 1 │ 2 │ 100 │ 200 │ └────────┴───────┴───────┴───────┴───────┘
Below is an outer join where missing values are filled with NaN
.
== t_right.name, how="outer") t_left.join(t_right, t_left.name
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ name_right ┃ x_100 ┃ y_100 ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ string │ int64 │ int64 │ ├────────┼───────┼───────┼────────────┼───────┼───────┤ │ a │ 1 │ 2 │ a │ 100 │ 200 │ │ b │ 3 │ 4 │ NULL │ NULL │ NULL │ │ c │ 4 │ 6 │ NULL │ NULL │ NULL │ │ NULL │ NULL │ NULL │ m │ 300 │ 400 │ │ NULL │ NULL │ NULL │ n │ 400 │ 600 │ └────────┴───────┴───────┴────────────┴───────┴───────┘
Concatenating tables
Concatenating DataFrame
s in pandas is done with the concat
top-level function. It takes multiple DataFrames
and concatenates the rows of one DataFrame
to the next. If the columns are mis-matched, it extends the list of columns to include the full set of columns and inserts NaN
s and None
s into the missing values.
Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done using the top-level union
function or the union
method on a table.
We’ll demonstrate a pandas concat
first.
= pd.DataFrame(
df_1
["a", 1, 2],
["b", 3, 4],
["c", 4, 6],
[
],=["name", "x", "y"],
columns
)
= pd.DataFrame(
df_2
["a", 100, 200],
["m", 300, 400],
["n", 400, 600],
[
],=["name", "x", "y"],
columns )
pd.concat([df_1, df_2])
name | x | y | |
---|---|---|---|
0 | a | 1 | 2 |
1 | b | 3 | 4 |
2 | c | 4 | 6 |
0 | a | 100 | 200 |
1 | m | 300 | 400 |
2 | n | 400 | 600 |
Now we can convert the DataFrame
s to Ibis tables and combine the tables using a union.
= ibis.memtable(df_1, name="t_1")
t_1 = ibis.memtable(df_2, name="t_2") t_2
= ibis.union(t_1, t_2)
unioned unioned
┏━━━━━━━━┳━━━━━━━┳━━━━━━━┓ ┃ name ┃ x ┃ y ┃ ┡━━━━━━━━╇━━━━━━━╇━━━━━━━┩ │ string │ int64 │ int64 │ ├────────┼───────┼───────┤ │ a │ 1 │ 2 │ │ b │ 3 │ 4 │ │ c │ 4 │ 6 │ │ a │ 100 │ 200 │ │ m │ 300 │ 400 │ │ n │ 400 │ 600 │ └────────┴───────┴───────┘