import ibis
= ibis.table(dict(one="string", two="float", three="int32"), name="my_data")
t t
UnboundTable: my_data one string two float64 three int32
WHERE
GROUP BY
ORDER BY
LIMIT
and OFFSET
DISTINCT
expressionsWe recommend starting with the default (DuckDB) backend for a performant, fully-featured local experience. You can install Ibis with pip
, conda
, mamba
, or pixi
.
Note that the ibis-framework
package is not the same as the ibis
package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis
module name.
Ibis provides a full-featured replacement for SQL SELECT
queries, but expressed with Python code that is:
We intend for all SELECT
queries to be fully portable to Ibis. Coverage of other DDL statements (e.g. CREATE TABLE
or INSERT
) may vary from engine to engine.
If you find any SQL idioms or use cases in your work that are not represented here, please reach out so we can add more to this guide!
All tables in Ibis are immutable. To select a subset of a table’s columns, or to add new columns, you must produce a new table by means of a projection.
UnboundTable: my_data one string two float64 three int32
In SQL, you might write something like:
In Ibis, this is
This generates the expected SQL:
What about adding new columns?
The last expression is written:
Now, we have:
mutate
: Add or modify columns easilySince adding new columns or modifying existing columns is so common, there is a convenience method mutate
:
Notice that using the name
was not necessary here because we’re using Python keyword arguments to provide the name:
If you modify an existing column with mutate
it will list out all the other columns:
SELECT *
equivalentEspecially in combination with relational joins, it’s convenient to be able to select all columns in a table using the SELECT *
construct. To do this, use the table expression itself in a projection:
This is how mutate
is implemented. The example above t.mutate(new_col=t.three * 2)
can be written as a normal projection:
Let’s consider a table we might wish to join with t
:
Now let’s take the SQL:
To write this with Ibis, it is:
And verify the generated SQL:
If you pass a function instead of a string or Ibis expression in any projection context, it will be invoked with the “parent” table as its argument. This can help significantly when [composing complex operations. Consider this SQL:
SELECT one, avg(abs(the_sum)) AS mad
FROM (
SELECT one, three, sum(two) AS the_sum
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
This can be written as one chained expression:
Here’s the SQL:
WHERE
You can add filter clauses to a table expression by using the filter
method:
filter
can take multiple expressions, which must all be satisfied for a row to appear in the result:
To compose boolean expressions with AND
or OR
, use the respective &
and |
operators:
Note the parentheses around the second expression. These are necessary due to operator precedence.
GROUP BY
To aggregate a table, you need:
Let’s look at the aggregate
method on tables:
If you don’t use any group expressions, the result will have a single row with your statistics of interest:
To add groupings, use either the by
argument of aggregate
or use the group_by
construct:
You can use any expression (or function, like in projections) deriving from the table you are aggregating. The only constraint is that the expressions must be named. Let’s look at an example:
Suppose we wanted to total up event types by year and month:
Now we have:
In analytics is it common to compare statistics from different subsets of a table. Let’s consider a dataset containing people’s name, age, gender, and nationality:
Now, suppose you wanted to know for each country:
In SQL, you may write:
SELECT
country,
count(*) AS num_persons,
AVG(age) AS avg_age
AVG(
CASE WHEN gender = 'M' THEN age
ELSE NULL
END
) AS avg_male,
AVG(
CASE WHEN gender = 'F' THEN age
ELSE NULL
END
) AS avg_female,
FROM population
GROUP BY 1
Ibis makes this much simpler by giving you where
option in aggregation functions:
This indeed generates the correct SQL. Note that SQL engines handle NULL
values differently in aggregation functions, but Ibis will write the SQL expression that is correct for your query engine.
Table.count()
:Computing group frequencies can be done by calling count()
after calling group_by()
:
value_counts
Consider the SQL idiom:
This is so common that, like pandas, there is a generic column method value_counts
which does this:
HAVING
clauseThe SQL HAVING
clause enables you to filter the results of an aggregation based on some group-wise condition holding true. For example, suppose we wanted to limit our analysis to groups containing at least 1000 observations:
With Ibis, you can do:
ORDER BY
To sort a table, use the order_by
method along with either column names or expressions that indicate the sorting keys:
The default for sorting is in ascending order. To reverse the sort direction of any key, wrap it in ibis.desc
:
LIMIT
and OFFSET
The table limit
function truncates a table to the indicates number of rows. So if you only want the first 1000 rows (which may not be deterministic depending on the SQL engine), you can do:
The offset
option in limit
skips rows. So if you wanted rows 11 through 20, you could do:
See the full API documentation for all of the available value methods and tools for creating value expressions.
We mention a few common ones here as they relate to common SQL queries.
Ibis’s type system is independent of any SQL system. You cast Ibis expressions from one Ibis type to another. For example:
CASE
statementsSQL dialects typically support one or more kind of CASE
statements. The first is the simple case that compares against exact values of an expression.
Value expressions in Ibis have a case
method that allows us to emulate these semantics:
The more general case is that of an arbitrary list of boolean expressions and result values:
CASE
WHEN boolean_expr1 THEN result_1
WHEN boolean_expr2 THEN result_2
WHEN boolean_expr3 THEN result_3
ELSE default
END
To do this, use ibis.cases
:
There are several places where Ibis builds cases for you in a simplified way. One example is the ifelse
function:
NULL
in expressionsTo use NULL
in an expression, use ibis.null()
value:
IN
/ NOT IN
Let’s look again at the population dataset. Suppose you wanted to combine the United States and Canada data into a “North America” category. Here would be some SQL to do it:
CASE
WHEN UPPER(country) IN ('UNITED STATES', 'CANADA') THEN 'North America'
ELSE country
END AS refined_group
The Ibis equivalent of IN
is the isin
method. So we can write:
The opposite of isin
is notin
.
Consider a SQL expression like:
which is equivalent to
To build expressions off constant values, you must first convert the value (whether a Python string or number) to an Ibis expression using ibis.literal
:
Once you’ve done this, you can use the literal expression like any other array or scalar expression:
In many other situations, you can use constants without having to use ibis.literal
. For example, we could add a column containing nothing but the number 5 like so:
IS NULL
and IS NOT NULL
These are simple: use the isnull
and notnull
functions respectively, which yield boolean arrays:
BETWEEN
The between
method on arrays and scalars compiles to the SQL BETWEEN
keyword. The result of between
is boolean and can be used with any other boolean expression:
Ibis supports several kinds of joins between table expressions:
inner_join
: maps to SQL INNER JOIN
cross_join
: a cartesian product join with no keys. Equivalent to inner_join
with no join predicatesleft_join
: maps to SQL LEFT OUTER JOIN
outer_join
: maps to SQL FULL OUTER JOIN
semi_join
: maps to SQL LEFT SEMI JOIN
. May or may not be an explicit join type in your query engine.anti_join
: maps to SQL LEFT ANTI JOIN
. May or may not be an explicit join type in your query engine.The join
table method is by default the same as inner_join
.
Let’s look at a couple example tables to see how joins work in Ibis:
Let’s join on one key:
The immediate result of a join does not yet have a set schema. That is determined by the next action that you take. There’s several ways forward from here that address a variety of SQL use cases.
Consider the SQL:
After one or more joins, you can reference any of the joined tables in a projection immediately after:
If you need to compute an expression that involves both tables, you can do that also:
You can directly aggregate a join without need for projection, which also allows you to form statistics that reference any of the joined tables.
Consider this SQL:
SELECT t0.key1, AVG(t0.value1 - t1.value2) AS avg_diff
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
GROUP BY 1
As you would hope, the code is as follows:
SELECT *
If you try to compile or execute a join that has not been projected or aggregated, it will be fully materialized:
You can join multiple tables together in succession without needing to address any of the above concerns.
t3 = ibis.table(dict(value3="float", key5="string"), name="table3")
total = (t1.value1 + t2.value2 + t3.value3).sum()
expr = (
t1.join(t2, [t1.key1 == t2.key3, t1.key2 == t2.key4])
.join(t3, t1.key1 == t3.key5)
.group_by([t2.key4, t3.key5])
.aggregate(total=total)
)
ibis.to_sql(expr)
SELECT
"t6"."key4",
"t6"."key5",
SUM("t6"."value1" + "t6"."value2" + "t6"."value3") AS "total"
FROM (
SELECT
"t3"."value1",
"t3"."key1",
"t3"."key2",
"t4"."value2",
"t4"."key3",
"t4"."key4",
"t5"."value3",
"t5"."key5"
FROM "table1" AS "t3"
INNER JOIN "table2" AS "t4"
ON "t3"."key1" = "t4"."key3" AND "t3"."key2" = "t4"."key4"
INNER JOIN "table3" AS "t5"
ON "t3"."key1" = "t5"."key5"
) AS "t6"
GROUP BY
1,
2
What about when you need to join a table on itself? For example:
SELECT t0.one, avg(t0.two - t1.three) AS metric
FROM my_data t0
INNER JOIN my_data t1
ON t0.one = t1.one
GROUP BY 1
The table view
method enables you to form a self-reference that is referentially distinct in expressions. Now you can proceed normally:
t_view = t.view()
stat = (t.two - t_view.three).mean()
expr = (
t.join(t_view, t.three.cast("string") == t_view.one)
.group_by(t.one)
.aggregate(metric=stat)
)
ibis.to_sql(expr)
SELECT
"t3"."one",
AVG("t3"."two" - "t3"."three_right") AS "metric"
FROM (
SELECT
"t1"."one",
"t1"."two",
"t1"."three",
"t2"."one" AS "one_right",
"t2"."two" AS "two_right",
"t2"."three" AS "three_right"
FROM "my_data" AS "t1"
INNER JOIN "my_data" AS "t2"
ON CAST("t1"."three" AS TEXT) = "t2"."one"
) AS "t3"
GROUP BY
1
In many cases the columns being joined between two tables or table expressions have the same name. Consider this example:
In these case, we can specify a list of common join keys:
You can mix the overlapping key names with other expressions:
joined = t4.join(t5, ["key1", "key2", t4.key3.left(4) == t5.key3.left(4)])
expr = joined.select(t4, t5.value2)
ibis.to_sql(expr)
SELECT
"t2"."key1",
"t2"."key2",
"t2"."key3",
"t2"."value1",
"t3"."value2"
FROM "table4" AS "t2"
INNER JOIN "table5" AS "t3"
ON "t2"."key1" = "t3"."key1"
AND "t2"."key2" = "t3"."key2"
AND SUBSTRING(
"t2"."key3",
CASE WHEN (
0 + 1
) >= 1 THEN 0 + 1 ELSE 0 + 1 + LENGTH("t2"."key3") END,
4
) = SUBSTRING(
"t3"."key3",
CASE WHEN (
0 + 1
) >= 1 THEN 0 + 1 ELSE 0 + 1 + LENGTH("t3"."key3") END,
4
)
You can join tables with boolean clauses that are not equality. Some query engines support these efficiently, some inefficiently, or some not at all. In the latter case, these conditions get moved by Ibis into the WHERE
part of the SELECT
query.
You can also pass a list of column names instead of forming boolean expressions:
Ibis creates inline views and nested subqueries automatically. This section concerns more complex subqueries involving foreign references and other advanced relational algebra.
EXISTS
/ NOT EXISTS
filtersThe SQL EXISTS
and NOT EXISTS
constructs are typically used for efficient filtering in large many-to-many relationships.
Let’s consider a web dataset involving website session / usage data and purchases:
Now, the key user_id
appears with high frequency in both tables. But let’s say you want to limit your analysis of the events
table to only sessions by users who have made a purchase.
In SQL, you can do this using the somewhat esoteric EXISTS
construct:
SELECT t0.*
FROM events t0
WHERE EXISTS (
SELECT 1
FROM purchases t1
WHERE t0.user_id = t1.user_id
)
To describe this operation in Ibis, you compare the user_id
columns and use the any
reduction:
This can now be used to filter events
:
If you negate the condition, it will instead give you only event data from user that have not made a purchase:
IN
/ NOT IN
Subquery filters with IN
(and NOT IN
) are functionally similar to EXISTS
subqueries. Let’s look at some SQL:
This is almost semantically the same as the EXISTS
example. Indeed, you can write with Ibis:
Depending on the query engine, the query planner/optimizer will often rewrite IN
or EXISTS
subqueries into the same set of relational algebra operations.
Sometime you want to compare a value with an unconditional aggregate value from a different table. Take the SQL:
With Ibis, the code is simpler and more pandas-like:
Suppose you want to compare a value with the aggregate value for some common group values between two tables. Here’s some SQL:
SELECT *
FROM table1 t0
WHERE value1 > (
SELECT avg(value2)
FROM table2 t1
WHERE t0.key1 = t1.key3
)
This query computes the average for each distinct value of key3
and uses the corresponding average for the comparison, rather than the whole-table average as above.
With Ibis, the code is similar, but you add the correlated filter to the average statistic:
DISTINCT
expressionsIn SQL, the DISTINCT
keyword is used in a couple of ways:
SELECT
statementIbis supports both use cases. So let’s have a look. The first case is the simplest: call distinct
on a table expression. First, here’s the SQL:
And the Ibis Python code:
For distinct aggregates, the most common case is COUNT(DISTINCT ...)
, which computes the number of unique values in an expression. So if we’re looking at the events
table, let’s compute the number of distinct event_type
values for each user_id
. First, the SQL:
In Ibis this is:
Window functions in SQL allow you to write expressions that involve possibly-ordered groups of a table. Each window function involves one of the following:
LEAD
, LAG
, NTILE
, and others.PARTITION BY
clause. This may be omitted.ORDER BY
clause. This may be omitted for many functions.So you may see SQL like:
Or simply
Ibis will automatically write window clauses when you use aggregate functions in a non-aggregate context. Suppose you wanted to subtract the mean of a column from itself:
If you use mutate
in conjunction with group_by
, it will add a PARTITION BY
to the OVER
specification:
For functions like LAG
that require an ordering, we can add an order_by
call:
For more precision, you can create a Window
object that also includes a window frame clause:
A common SQL idiom is the “top-K” or “top-N” operation: subsetting a dimension by aggregate statistics:
Ibis has a special analytic expression topk
:
This can be evaluated directly, yielding the above query:
topk
has a by
parameter that allows you to pass in an expression to compute the topk results by. This allows you to define more complex topk like the following:
This gives you the top 10 users with the highest total purchase amounts.
Furthermore, you can compute topk over tables by using the special SQL function, ROW_NUMBER
. Ibis provides a row_number()
function that allows you to do this:
The output of this is a table with the three most expensive items that each user has purchased (or, in plain SQL translation: the top three items per user_id that have the maximum price).
See Timestamp methods <api.timestamp>
for a table of available date/time methods.
For example, we can do:
In many cases, you can convert string values to datetime / timestamp with strings.cast('timestamp')
, but you may have to do some more reconnaissance into the data if this does not work.
Ibis has a set of interval APIs that allow you to do date/time arithmetic. For example:
The implementation of each timedelta offset will depend on the query engine.
To appear.
Set operations are a common SQL idiom for combining or comparing rows from different relations. Ibis supports the following set operations:
SQL dialects often support two kinds of UNION
operations:
UNION
: the combination of distinct rows from each table.UNION ALL
: the combination of all rows from each table, whether or not they are distinct.The Ibis union
function by default is a UNION ALL
, and you can set distinct=True
to get the normal UNION
behavior:
SQL dialects support INTERSECT
and EXCEPT
operators that returns rows that are in both tables or only in one table, respectively.
In Ibis, we can perform these operations using intersection
and difference
.
This area will be the spillover for miscellaneous SQL concepts and how queries featuring them can be ported to Ibis.
The simplest SQL CTE is a SQL statement that is used multiple times in a SELECT
query, which can be “factored” out using the WITH
keyword:
WITH t0 AS (
SELECT region, kind, sum(amount) AS total
FROM purchases
GROUP BY 1, 2
)
SELECT t0.region, t0.total - t1.total
FROM t0
INNER JOIN t0 t1
ON t0.region = t1.region
WHERE t0.kind = 'foo' AND t1.kind = 'bar'
Explicit CTEs are not necessary with Ibis. Let’s look at an example involving joining an aggregated table on itself after filtering:
purchases = ibis.table(
dict(region="string", kind="string", user="int64", amount="float"),
name="purchases",
)
metric = purchases.amount.sum().name("total")
agged = purchases.group_by(["region", "kind"]).aggregate(metric)
left = agged.filter(agged.kind == "foo")
right = agged.filter(agged.kind == "bar")
result = (
left.join(right, left.region == right.region)
.select(left.region, (left.total - right.total).name("diff"))
)
Ibis automatically creates a CTE for agged
:
WITH "t1" AS (
SELECT
"t0"."region",
"t0"."kind",
SUM("t0"."amount") AS "total"
FROM "purchases" AS "t0"
GROUP BY
1,
2
)
SELECT
"t5"."region",
"t5"."total" - "t6"."total" AS "diff"
FROM (
SELECT
*
FROM "t1" AS "t2"
WHERE
"t2"."kind" = 'foo'
) AS "t5"
INNER JOIN (
SELECT
*
FROM "t1" AS "t2"
WHERE
"t2"."kind" = 'bar'
) AS "t6"
ON "t5"."region" = "t6"."region"