The best benchmark is your own workload on your own data.
Key considerations
The purpose of this post is to explore some benchmarking data with Ibis. We’ll compare three modern single-node query engines, explore the Ibis API as a great choice for each of them, and discuss the results.
The TPC-H benchmark is a benchmark for databases and, increasingly, dataframes! It consists of 22 queries across 8 tables. The SQL (or dataframe) code representing the queries is designed to test the performance of a query engine on a variety of tasks including filtering, aggregation, and joins. SQL queries are defined by the TPC-H benchmark. We run the SQL queries and equivalent dataframe code via Ibis and Polars APIs.
The data for the benchmark can be generated at any scale factor, which roughly corresponds to the size of the data in memory in gigabytes. For instance, a scale factor of 10 would be about 10GB of data in memory.
The engines, the API, the code
We’ll use three modern single-node OLAP engines (DuckDB, DataFusion, Polars) with the Ibis API via two coding paradigms (dataframe and SQL). Ibis provides a consistent API across 20+ backends, including these three. We run SQL code through Ibis in addition to dataframe code to get a sense of any overhead in Ibis dataframe code.
Note
Ibis dataframe code generates SQL for the DuckDB and DataFusion backends and generates Polars API dataframe code for the Polars backend.
If you’re interested in contributing to Ibis, a new backend like chDB could be a great project for you!
9/22 queries for Ibis with the Polars backend fail from lack of scalar subquery support. Due to this and relatively experimental SQL support in Polars, we’ve opted to run on the Polars API directly in this iteration of the benchmark. This is done with the LazyFrames API and no streaming engine (per the Polars team’s recommendation). This also allows us to compare the performance of the Polars backend through Ibis with the Polars API directly for the queries that do succeed.
Failing queries
Queries fail for one of two reasons:
The query doesn’t work in the given system
The query otherwise failed on a given run
We’ll note the cases of the first below. The second is usually due to memory pressure and will be seen at higher scale factors throughout the data.
Queries 11, 13-17, and 20-22 fail for the Polars backend via Ibis dataframe code. These all fail due to lack of scalar subquery support in the backend. I’ve opened an issue for tracking and discussion.
Interested in contributing?
Increasing coverage of operations for a backend is a great place to start!
How queries are written
See the source code for the exact queries used in this iteration of the benchmark. Polars recently updated their TPC-H queries, so the next iteration of this benchmark would use those.
Queries were adapted from Ibis TPC-H queries and Polars TPC-H queries. The first 10 Ibis dataframe queries were translated from the Polars dataframe queries, while the rest were directly adapted from the Ibis repository. The SQL strings were adapted from the Ibis repository.
data is always downloaded onto a compute instance (no cloud storage reads)
decimal types are converted to floats after reading
works around several issues
in the next iteration of this benchmark, we’ll use the decimal type
each query is run three times per configuration (system, scale factor, instance type)
we measure the time to write the results of the query to a Parquet file
this includes reading the Parquet file(s) and executing the query
Biases
My name is Cody and I’m a Senior Technical Product Manager at Voltron Data. I am a contributor to the Ibis project and employed to work on it – I’m biased in favor of Ibis and the composable data ecosystem.
We’ll use Ibis to analyze some of the benchmarking data.
Tip
We’ll only look at a small subset of the data in this post.
All the data is public, so you can follow along with the code and explore the data yourself. You can also see the Ibis benchmarking Streamlit app for further analysis.
Reading the data
To follow along, install the required Python packages:
pip install gcsfs 'ibis-framework[duckdb]' plotly
The data is stored in a public Google Cloud Storage (GCS) bucket:
A scale factor is roughly the size of the data in memory in gigabytes. For example, a scale factor of 1 means the data is roughly 1GB in memory.
Stored on disk in (compressed) Parquet format, the data is smaller – about 0.38GB for scale factor 1 with the compression settings used in this benchmark.
We can look at the total execution time by scale factor:
Show bar plot code
c = px.bar( t.group_by("sf").agg(total_seconds=t["execution_seconds"].sum()), x="sf", y="total_seconds", category_orders={"sf": sfs}, title="total execution time by scale factor",)c
We can look at the total execution time by query number:
Show bar plot code
c = px.bar( t.group_by("query_number").agg(total_seconds=t["execution_seconds"].sum()), x="query_number", y="total_seconds", category_orders={"query_number": query_numbers}, title="total execution time by query number",)c
This gives us a sense of the relative complexity of the queries.
An instance type is the compute the benchmark was run on. This consists of two MacBook Pro laptops (one work and one personal) and a number of Google Cloud Compute Engine instances.
For cloud VMs, the instance type is in the form of <family>-<type>-<cores>, where:
<family> specifies the CPU architecture (Intel X, AMD Y)
<type> modifies the CPU to memory ratio (only standard is used with a 1:4)
<cores> is the number of vCPUs
For example, n2d-standard-2 is a Google Cloud Compute Engine instance with an AMD EPYC processor, 2 vCPUs, and 8GB of memory.
We can look at the total execution time by instance type:
Show bar plot code
c = px.bar( t.group_by("instance_type") .agg(total_seconds=t["execution_seconds"].sum()) .join(instance_details, "instance_type"), x="instance_type", y="total_seconds", color="cpu_type", hover_data=["cpu_cores", "memory_gbs"], category_orders={"instance_type": instance_types,"cpu_type": cpu_types, }, title="total execution time by instance type",)c
Unsurprisingly, this is inversely correlated with the number of CPU cores and (crucially) memory:
Show bar plot code
c = px.bar( instance_details, x="instance_type", y="memory_gbs", color="cpu_type", hover_data=["cpu_cores", "memory_gbs"], category_orders={"instance_type": instance_types,"cpu_type": cpu_types, }, title="memory by instance type",)c
We ran on the following systems:
systems =sorted(t.distinct(on="system")["system"].to_pyarrow().to_pylist())systems
For convenience in this benchmark, a ‘system’ is defined as a hyphen-separated naming convention where:
ibis-*: Ibis API was used
ibis-<backend>: Ibis dataframe code was used with the given backend
ibis-<backend>-sql: SQL code was used via Ibis on the given backend
polars-*: Polars API was used
polars-lazy: Polars was used with the LazyFrames API
We can look at the total execution time by system:
Show bar plot code
c = px.bar( t.group_by("system").agg( total_seconds=t["execution_seconds"].sum(), seconds_per_query=t["execution_seconds"].mean(), ), x="system", y="total_seconds", color="system", category_orders={"system": systems}, title="total execution time by system",)c
This can be misleading!
At this point, we have to dig deeper into the data to understand the takeaways. You might look at the above and think ibis-polars is the fastest all-around, but it’s not! Recall 9/22 queries for the Polars backend are failing, and at larger scale factors we start to see several systems fail queries due to memory pressure.
Execution time by system, scale factor, instance type, and query
We’ll aggregate the data over the dimensions we care about:
There’s a lot of data and it’s difficult to visualize all at once. We’ll build up our understanding with a few plots.
Show code for timings_plot
def timings_plot( agg, sf_filter=128, systems_filter=systems, instances_filter=[instance for instance in instance_types if"laptop"in instance], queries_filter=query_numbers, log_y=True,): data = ( agg.filter(agg["sf"] == sf_filter) .filter(agg["system"].isin(systems_filter)) .filter(agg["instance_type"].isin(instances_filter)) .filter(agg["query_number"].isin(queries_filter)) ) c = px.bar( data, x="query_number", y="mean_execution_seconds", log_y=log_y, color="system", barmode="group", pattern_shape="instance_type", category_orders={"system": systems,"instance_type": instance_types, }, hover_data=["cpu_type", "cpu_cores", "memory_gbs"], title=f"sf: {sf_filter}", )return c
First, let’s visualize execution time for a given scale factor, system, query, and family of instance types:
sf_filter =128systems_filter = ["ibis-duckdb"]instances_filter = [ instance for instance in instance_types if instance.startswith("n2d")]queries_filter = [1]log_y =Falsetimings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)
From left to right, we have increasing instance resources (CPU cores and memory – you can hover over the data to see the details). You can also zoom and select specific labels to focus on. We notice that, as expected, queries execute faster when given more resources.
ibis-duckdb is running the TPC-H queries written as Ibis dataframe code. The ibis-duckdb-sql system is running the same queries but written as SQL code passed into .sql() in Ibis as strings. The intent is to see if Ibis dataframe code is introducing any significant overhead. While ideally we’d run on the backend’s Python client without Ibis in the mix, this keeps the benchmarking process simple and should serve as a decent proxy.
In this case, we do see that Ibis dataframe code is adding some overhead. But, this is a single data point – let’s expand to the first 7 queries:
Logging the y-axis
From here, we’ll set log_y=True due to the wide range of execution times.
We also look at the first 7 queries due to limited horizontal space on this website. Analyze and visualize the data yourself for all 22 queries! Or see the Ibis benchmarking Streamlit app.
Ibis dataframe code is a little faster overall, but this is on a subset of queries and scale factors and instance types. More analysis and profiling would be needed to make a definitive statement, but in general we can be happy that DuckDB does a great job optimizing the SQL Ibis generates and that Ibis dataframe code isn’t adding significant overhead.
This time Ibis dataframe code is a bit slower overall. However, also notice two queries are missing from ibis-datafusion-sql. These are query 7 on n2d-standard-2 and n2d-standard-4 (the two instances with the least memory). We’ll investigate failing queries more thoroughly in the next section.
A lot of queries are missing from ibis-polars and polars-lazy. These are failing due to the high scale factor and limited memory on the instances.
Let’s look at a lower scale factor and my MacBooks (Polars tended to perform better on these):
sf_filter =64instances_filter = [ instance for instance in instance_types if"laptop"in instance]timings_plot(agg, sf_filter, systems_filter, instances_filter, queries_filter, log_y)
We see a similar pattern as above – some queries are a little faster on ibis-polars, though some are much slower. In particular queries 1 and 2 tend to have a lot of overhead.
There is a lot of data and it’s easy to summarize and visualize it in a way that favors a given system. There’s a lot of missing data that needs to be accounted for, as it often indicates a query that failed due to memory pressure.
Within each system, from left to right, we have decreasing resources (vCPUs and memory). We can see how each system performs on the same queries with different resources.
Data is aggregated
Keep in mind data is aggregated over three runs of each query. For DuckDB, there was actually a single failure on the smallest instance for query 9, out of six runs across the two systems, but this does not appear above because we are checking for the success of the query in any of the three runs per configuration.
If you have any questions or concerns, feel free to open an issue or comment on this post below.
Which system is best?
Trick question! It depends on your use case. DuckDB is a simple, performant in-process database with an on-disk file format (SQLite for OLAP). DataFusion is an extensible query engine and is often used for building databases or query engines. Polars is an OLAP query engine with a Python dataframe API that can be used as a more performant alternative to pandas.
All three make great Ibis backends and you can switch between them in a single line of code. This lets you write your code once and run it on the engine that’s best for your use case. If a better engine comes along you’ll likely be able to use that too. And you can scale up and out across the 20+ backends Ibis supports as needed.
TPC-H is a decent benchmark for what it benchmarks, which is limited. We’re not running window functions, doing timeseries analysis, or feature engineering for machine learning. We’re not using nested data types. We’re not performing regexes or using LLMs in UDFs…
It’s easy to summarize and visualize benchmarking data in a way that favors a given system. You should favor the system that works best for your use case.
Performance converges over time
Let’s look at some quotes from “Perf is not enough” by Jordan Tigani of MotherDuck:
If you take a bunch of databases, all actively maintained, and iterate them out a few years, performance is going to converge. If Clickhouse is applying a technique that gives it an advantage for scan speed today, Snowflake will likely have that within a year or two. If Snowflake adds incrementally materialized views, BigQuery will soon follow. It is unlikely that important performance differences will persist over time.
As clever as the engineers working for any of these companies are, none of them possess any magic incantations or things that cannot be replicated elsewhere. Each database uses a different bag of tricks in order to get good performance. One might compile queries to machine code, another might cache data on local SSDs, and a third might use specialized network hardware to do shuffles. Given time, all of these techniques can be implemented by anyone. If they work well, they likely will show up everywhere.
This is extra true for open source databases (or query engines). If DuckDB adds a feature that improves performance, it’s likely that DataFusion and Polars will follow suit – they can go read the source code and specific commits to see how it was done.
bench run -s 1 ibis-duckdb ibis-duckdb-sql ibis-datafusion ibis-datafusion-sql ibis-polars polars-lazy
Congratulations! You’ve run a TPC-H benchmark on DuckDB (Ibis dataframe code and SQL), DataFusion (Ibis dataframe code and SQL), and Polars (dataframe code via Ibis and native Polars).
What just happened?
This will generate TPC-H data at scale factor 1 as Parquet files in the tpch_data directory:
The scale factor is roughly the size of data in memory in gigabytes (GBs). The size of data on disk, however, is smaller because Parquet is compressed. We can take a look at the size of the data:
The queries are also defined in ibis_bench.queries. Let’s look at query 4 as an example for Ibis dataframe code, Polars dataframe code, and SQL code via Ibis:
res = q4(lineitem.to_polars().lazy(), orders.to_polars().lazy()).collect()res
shape: (5, 2)
o_orderpriority
order_count
str
u32
"1-URGENT"
10594
"2-HIGH"
10476
"3-MEDIUM"
10410
"4-NOT SPECIFIED"
10556
"5-LOW"
10487
Define query 4:
q4_sql ="""SELECT o_orderpriority, count(*) AS order_countFROM ordersWHERE o_orderdate >= CAST('1993-07-01' AS date) AND o_orderdate < CAST('1993-10-01' AS date) AND EXISTS ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate)GROUP BY o_orderpriorityORDER BY o_orderpriority;"""q4_sql = q4_sql.strip().strip(";")def q4(lineitem, orders, dialect="duckdb", **kwargs):return orders.sql(q4_sql, dialect=dialect)
Finally, we write the result to a Parquet file. We are measuring the execution time in seconds of calling the query and writing the results to disk.
Next steps
We’ll publish the next iteration of this benchmark soon with updated Polars TPC-H queries and using newer versions of all libraries. Polars v1.0.0 should release soon. A new DataFusion version that fixes the remaining failing queries is also expected soon.
If you spot anything wrong, have any questions, or want to share your own analysis, feel free to share below!