Does Ibis understand SQL?

Last month, an insightful article on the dbt Developer Blog on what SQL comprehension really means came across my LinkedIn feed. The big deal about SDF is that it, unlike dbt, actually understands SQL. As an Ibis user and contributor, several of the concepts covered in the post were familiar—in fact, I first learned about Ibis because the product I was working on required an intermediate representation that could be compiled to Flink SQL code. In that case, as a dataframe library that interfaces with databases, does Ibis also understand SQL?

Tl;dr

Ibis doesn’t understand SQL per se, but it does understand what you’re trying to do. Ibis, much like SQL, defines a standardized interface for working with databases. Because Ibis understands queries expressed through this user interface, it also provides users with some of the unique capabilities SDF offers, including the ability to execute said logic on the backend of the user’s choice.

How does Ibis work?

To answer the question of whether Ibis understands SQL, we first need to understand the internals of Ibis. Specifically, how is the code that users write with Ibis eventually executed on a SQL backend?

Building an expression

Ibis provides a dataframe API for writing expressions. A follow-up article on the dbt Developer Blog on the key technologies behind SQL comprehension used the following SQL query in illustrating what the parser and compiler do:

select x as u from t where x > 0

In SQL, the binder adds type information to the syntax tree produced by the parser. This order of operations differs from the way Ibis works; in Ibis, Nodes— the core operations that can be applied to expressions, such as ibis.expr.operations.Add and ibis.expr.operations.WindowFunction—must be applied to Expr objects containing data type and shape information.

The Table is one of the core Ibis data structures, analogous to a SQL table. It’s also an Expr subclass. We begin by manually defining a Table with our desired schema here, but one can also construct a table from an existing database table, file, or in-memory data representation:

import ibis

t = ibis.table(dict(x="int32", y="float", z="string"), name="t")

Next, we apply a filter and rename the x column as in the SQL query above:

expr = t.filter(t.x > 0).select(t.x.name("u"))

In SQL, the parser translates the query into a syntax tree, but in Ibis, expressions are inherently represented as a tree of Expr and Node objects. Ibis enables users to visualize() this intermediate representation for any expression:

from ibis.expr.visualize import to_graph

to_graph(expr)

In non-interactive mode, pretty-printing the expression yields an equivalent textual representation:

expr

r0 := UnboundTable: t
  x int32
  y float64
  z string

r1 := Filter[r0]
  r0.x > 0

Project[r1]
  u: r1.x

Look at that! Unsurprisingly, the resulting repr() matches the generated logical plan from the SQL comprehension technologies article:

Note that the abstract syntax tree is an artifact of the parsing step and has no direct Ibis analog.

Compiling the expression

In the past, Ibis would compile expressions to SQL using its own SQL generation logic. However, with the completion of “the big refactor” in Ibis 9.0, Ibis fully transitioned to producing SQLGlot expressions under the hood.

We can see the intermediate SQLGlot representation of our expression using the to_sqlglot() method on an Ibis backend compiler implementation:

from ibis.backends.sql.compilers.duckdb import compiler

query = compiler.to_sqlglot(expr)
query

Select(
  expressions=[
    Alias(
      this=Column(
        this=Identifier(this='x', quoted=True),
        table=Identifier(this='t0', quoted=True)),
      alias=Identifier(this='u', quoted=True))],
  from=From(
    this=Table(
      this=Identifier(this='t', quoted=True),
      alias=Identifier(this='t0', quoted=True))),
  where=Where(
    this=GT(
      this=Column(
        this=Identifier(this='x', quoted=True),
        table=Identifier(this='t0', quoted=True)),
      expression=Literal(this=0, is_string=False))))

Ibis delegates SQL generation to SQLGlot, essentially calling the sql() method on the above query:

query.sql()

'SELECT "t0"."x" AS "u" FROM "t" AS "t0" WHERE "t0"."x" > 0'

Ibis also provides a top-level to_sql() method, so most users don’t need to be aware of SQLGlot or the inner workings of Ibis expression compilation—unless they want to:

ibis.to_sql(expr)

SELECT
  "t0"."x" AS "u"
FROM "t" AS "t0"
WHERE
  "t0"."x" > 0

Why SQLGlot?

SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It’s a widely-used open-source project that powers the SQL comprehension and generation capabilities of tools like SQLMesh, Apache Superset, and Dagster. In fact, SQLGlot is also the engine behind the column-level lineage feature available in dbt Cloud!

The specifics of why Ibis chose SQLGlot are beyond the scope of this article, but you can learn more about the reasoning from a GitHub discussion on moving the SQL backends from SQLAlchemy to SQLGlot.

Executing the compiled expression

Execution is the most straightforward part of the process. For most Ibis-supported SQL backends, the execute() method uses the database connection associated with the backend instance—usually managed by the underlying Python client library—to submit and fetch results for the compiled query. Last but not least, Ibis massages the returned data into the desired format (e.g. a pandas DataFrame for easy consumption). Because this final processing step falls outside the expression-understanding process, we’ll end our journey through the Ibis execution flow here.

Note that the database still performs all of the activities covered by the SQL comprehension levels; the database is completely oblivious to whatever Ibis did prior to providing the raw SQL to execute.

So, does Ibis understand SQL?

Ibis’s expressive dataframe API lets users avoid writing handcrafted SQL queries in most situations, but there are still cases where you may need to use SQL strings in your Ibis code. Table.sql() and Backend.sql() work very similarly to each other. To explore both of these options, we’ll need to initialize a backend connection (DuckDB works perfectly for this purpose) and register our table:

con = ibis.duckdb.connect()
t = con.create_table("t", t)

Now we can use the Table.sql() method to handle our SQL query:

expr = t.sql("select x as u from t where x > 0")
expr

r0 := DatabaseTable: memory.main.t
  x int32
  y float64
  z string

SQLStringView[r0]
  query:
    select x as u from t where x > 0
  schema:
    u int32

Note that the expression tree doesn’t have the same level of detail as the one we built using the dataframe API in the first section. Instead, a SQLStringView node encapsulates the query. Ibis only understands the output schema for the operation, which it uses to validate any downstream operations.

The main difference between Table.sql() and Backend.sql() is that Backend.sql() can only refer to tables that already exist in the database. We see that reflected in the expression tree repr(); the DatabaseTable node is not present in the Backend.sql() case, and a SQLQueryResult node (that does not contain a reference to another Ibis relation) replaces the SQLStringView node:

expr = con.sql("select x as u from t where x > 0")
expr

SQLQueryResult
  query:
    select x as u from t where x > 0
  schema:
    u int32

That said, the intermediate SQLGlot representations are identical for both alternatives:

con.compiler.to_sqlglot(expr)

Select(
  expressions=[
    Alias(
      this=Column(
        this=Identifier(this=x, quoted=False)),
      alias=Identifier(this=u, quoted=False))],
  from=From(
    this=Table(
      this=Identifier(this=t, quoted=False))),
  where=Where(
    this=GT(
      this=Column(
        this=Identifier(this=x, quoted=False)),
      expression=Literal(this=0, is_string=False))))

In fact, the resulting SQLGlot expression is only slightly, nonfunctionally different from the one we got from the dataframe API. Similarly, the compiled SQL is identical except for the table alias:

ibis.to_sql(expr)

SELECT
  x AS u
FROM t
WHERE
  x > 0

In short, while Ibis doesn’t understand the inner structure of SQL queries passed to Table.sql() or Backend.sql(), it still compiles to SQLGlot expressions under the hood, which means SQLGlot’s capabilities (like producing column-level lineage) are still applicable.

The third option for writing SQL strings in Ibis, Backend.raw_sql(), is opaque and only exists for situations where the user needs to run arbitrary SQL code. Ibis simply executes the code and returns the associated cursor; it does not attempt to understand the SQL.