Ibis + Substrait + DuckDB

blog
substrait
ecosystem
duckdb
Author

Gil Forsyth

Published

February 1, 2023

Ibis strives to provide a consistent interface for interacting with a multitude of different analytical execution engines, most of which (but not all) speak some dialect of SQL.

Today, Ibis accomplishes this with a lot of help from sqlalchemy and sqlglot to handle differences in dialect, or we interact directly with available Python bindings (for instance with the pandas, datafusion, and polars backends).

Ibis goes to great lengths to generate sane and consistent SQL for those backends that use it. We are also interested in exploring other means of communicating consistently with those backends.

Substrait is a new cross-language serialization format for communicating (among other things) query plans. It’s still in its early days, but there is already nascent support for Substrait in Apache Arrow, DuckDB, and Velox.

Ibis supports producing Substrait plans from Ibis table expressions, with the help of the ibis-substrait library. Let’s take a quick peek at how we might use it for query execution.

Getting started

First, we can create a conda environment using the latest versions of duckdb, ibis, and ibis_substrait.

mamba create -n ibis_substrait_duckdb ibis-framework==4.1 ibis-substrait==2.19 ipython python-duckdb parsy==2

Next, we’ll need to choose a dataset. For this example, we’ll use data from IMDB, available through their dataset portal.

For convenience, I used Ready, Set, Data! to grab the data in parquet format and then insert it into a DuckDB database.

import duckdb
con = duckdb.connect("/home/gil/imdb.ddb")
con.execute(
    "CREATE TABLE ratings AS SELECT * FROM '/home/gil/data/imdb/imdb_ratings.parquet'"
)
con.execute(
    "CREATE TABLE basics AS SELECT * FROM '/home/gil/data/imdb/imdb_basics.parquet'"
)

Query Creation

For our example, we’ll build up a query using Ibis but without connecting to our execution engine (DuckDB). Once we have an Ibis table expression, we’ll create a Substrait plan, then execute that plan directly on DuckDB to get results.

To do this, all we need is some knowledge of the schema of the tables we want to interact with. We might get these schema from a metadata store, or possibly a coworker, or a friendly mouse.

However we arrive at it, if we know the column names and the datatypes, we can build up a query in Ibis, so let’s do that.

import ibis
from ibis import _

ratings = ibis.table(
    [
        ("tconst", "str"),
        ("averageRating", "str"),
        ("numVotes", "str"),
    ],
    name="ratings",
)

basics = ibis.table(
    [
        ("tconst", "str"),
        ("titleType", "str"),
        ("primaryTitle", "str"),
        ("originalTitle", "str"),
        ("isAdult", "str"),
        ("startYear", "str"),
        ("endYear", "str"),
        ("runtimeMinutes", "str"),
        ("genres", "str"),
    ],
    name="basics",
)

Now that those tables are represented in Ibis, we can start creating our query. We’ll try to recreate the top-ten movies on the IMDB leaderboard. For that, we’ll need movie titles and their respective ratings.

We know that the data we have for ratings looks something like the following:

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ tconst    ┃ averageRating ┃ numVotes ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ string    │ string        │ string   │
├───────────┼───────────────┼──────────┤
│ tt0000001 │ 5.71919\n   │
│ tt0000002 │ 5.8260\n    │
│ tt0000003 │ 6.51726\n   │
│ tt0000004 │ 5.6173\n    │
│ tt0000005 │ 6.22541\n   │
└───────────┴───────────────┴──────────┘

Based on the column names alone, averageRating is almost certainly supposed to be a float, and numVotes should be an integer. We can cast those so we can make useful comparisons between ratings and vote numbers.

ratings = ratings.select(
    ratings.tconst,
    avg_rating=ratings.averageRating.cast("float"),
    num_votes=ratings.numVotes.cast("int"),
)

The first few rows of basics looks like this:

┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━┓
┃ tconst    ┃ titleType ┃ primaryTitle           ┃ originalTitle          ┃ isAdult ┃ startYear ┃ … ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━┩
│ string    │ string    │ string                 │ string                 │ string  │ string    │ … │
├───────────┼───────────┼────────────────────────┼────────────────────────┼─────────┼───────────┼───┤
│ tt0000001 │ short     │ Carmencita             │ Carmencita             │ 01894      │ … │
│ tt0000002 │ short     │ Le clown et ses chiens │ Le clown et ses chiens │ 01892      │ … │
│ tt0000003 │ short     │ Pauvre Pierrot         │ Pauvre Pierrot         │ 01892      │ … │
│ tt0000004 │ short     │ Un bon bock            │ Un bon bock            │ 01892      │ … │
│ tt0000005 │ short     │ Blacksmith Scene       │ Blacksmith Scene       │ 01893      │ … │
└───────────┴───────────┴────────────────────────┴────────────────────────┴─────────┴───────────┴───┘

In the interest of keeping things family-friendly, we can filter out any adult films. We can filter out any IMDB titles that aren’t movies, then select out the columns tconst and primaryTitle. And we’ll include startYear just in case it’s interesting.

basics = basics.filter([basics.titleType == "movie", basics.isAdult == "0"]).select(
    "tconst",
    "primaryTitle",
    "startYear",
)

With the data (lightly) cleaned up, we can construct our query for top films. We want to join the two tables ratings and basics. Then we’ll order them by avg_rating and num_votes, and include an additional filter that the movie has to have at least 200,000 votes.

topfilms = (
    ratings.join(basics, "tconst")
    .order_by([_.avg_rating.desc(), _.num_votes.desc()])
    .filter(_.num_votes > 200_000)
    .limit(10)
)

Now that we have an Ibis table expression, it’s time for Substrait to enter the scene.

Substrait Serialization

We’re going to import ibis_substrait and compile the topfilms table expression into a Substrait plan.

from ibis_substrait.compiler.core import SubstraitCompiler

compiler = SubstraitCompiler()

plan = compiler.compile(topfilms)

# type(plan) --> <class 'substrait.ibis.plan_pb2.Plan'>

Substrait is built using protobuf. If you look at the repr for plan, you’ll see a LOOOONG JSON-ish representation of the Substrait plan. This representation is not really meant for human eyes.

We’ll serialize the Substrait plan to disk and open it up in a separate session, or on another machine, entirely. That’s one of the notions of Substrait: plans can be serialized and shuttled around between various systems. It’s similar to Ibis in that it allows a separation of plan creation from plan execution.

with open("topfilms.proto", "wb") as f:
    f.write(plan.SerializeToString())

Substrait Plan Execution

Now we can open up the serialized Substrait plan in a new session where we execute it using DuckDB directly. One important point to note here is that our plan refers to two tables, named basics and ratings. If those tables don’t exist in our execution engine, then this isn’t going to work.

import duckdb

con = duckdb.connect("/home/gil/imdb.ddb")

con.execute("PRAGMA show_tables;").fetchall()
basics
ratings

Luckily, they do exist! Let’s install and load the DuckDB Substrait extension, then execute the Substrait plan, and finally grab our results.

con.install_extension("substrait")
con.load_extension("substrait")

with open("topfilms.proto", "rb") as f:
    plan_blob = f.read()

result = con.from_substrait(plan_blob)

result.fetchall()
tt0111161 9.3 2651547 The Shawshank Redemption 1994
tt0068646 9.2 1838044 The Godfather 1972
tt0468569 9.0 2623735 The Dark Knight 2008
tt0167260 9.0 1827464 The Lord of the Rings: The Return of the King 2003
tt0108052 9.0 1343647 Schindler’s List 1993
tt0071562 9.0 1259465 The Godfather Part II 1974
tt0050083 9.0 782903 12 Angry Men 1957
tt0110912 8.9 2029684 Pulp Fiction 1994
tt1375666 8.8 2325417 Inception 2010
tt0137523 8.8 2096752 Fight Club 1999

That looks about right to me. There may be some small differences with the current Top 10 list on IMDB if our data are a little stale.

It’s early days still for Substrait, but it’s exciting to see how far it’s come in the last 18 months!

Why wouldn’t I just use SQL for this?

It’s a fair question. SQL is everywhere, after all.

There are a few reasons we think you shouldn’t ignore Substrait.

Standards

SQL has a standard, but how closely do engines follow the standard? In our experience, queries don’t translate well between engines (this is one reason Ibis exists!)

Extensibility

Substrait is more extensible than SQL. Some DBMS have added in some very cool features, but it usually involves diverging (sometimes widely) from the SQL standard. Substrait has an extension system that allows plan producers and plan consumers to agree on a well-typed and well-defined interaction that exists outside of the core Substrait specification.

Serialization and parsing

Parsing SQL can be a big pain (trust us). If you send a big string over the wire, you need the engine on the other side to have a SQL parser to understand what the message is. Now, obviously, SQL engines have those. But here, again, standards (or lack of adherence to standards) can bite you. Extensibility is also difficult here, because now the SQL parser needs to understand some new custom syntax.

Protobuf is hardly a dream to work with, but it’s a lot easier to consistently define behavior AND to validate that behavior is correct. It’s also smaller than raw text.

Wrapping Up

That’s all for now! To quickly summarize:

Substrait is a new standard for representing relational algebra queries with support in Apache Arrow, DuckDB, Velox, and more (and more to come!).

Ibis can now generate substrait instead of string SQL, letting it take advantage of this new standard.

Interested in substrait or ibis? Docs are available at

and the relevant GitHub repos are

Please feel free to reach out on GitHub!

Back to top