TL; DR : Ibis supports both Polars and DataFusion. Both backends are have about the same runtime performance, and lag far behind DuckDB on this workload. There’s negligible performance difference between Ibis and the backend native APIs.
Motivation
This is part 2 of a series of posts showing performance across various backends that Ibis supports.
Check out part 1 if you haven’t already!
In this post, I’ll continue with the Polars and DataFusion backends.
I show each tool using both the Ibis API and the tool’s native API. We’ll see that the performance difference between these approaches is negligible.
Setup
I ran all of the code in this blog post on a machine with these specs.
CPU
AMD EPYC 7B12 (64 threads)
RAM
94 GiB
Disk
1.5 TiB SSD
OS
NixOS (Linux 6.1.66)
Library versions
Here are the versions I used to run this experiment at the time of writing.
Python
3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0]
datafusion
33.0.0
ibis
2b54b9800
pandas
2.1.4
polars
0.19.19
pyarrow
14.0.1
Running the query across backends
Here are the different Ibis expressions for each backend as well as the same query with native APIs, along with timed executions of the query.
DuckDB
First, let’s run the Ibis + DuckDB version of the query from the original post:
from __future__ import annotations
import ibis
from ibis import _
expr = (
ibis.read_parquet("/data/pypi-parquet/*.parquet" )
.filter (
[
_.path.re_search(
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
),
~ _.path.re_search(r"(^|/)test(|s|ing)" ),
~ _.path.contains("/site-packages/" ),
]
)
.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 )
.re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" )
.re_replace("^f.*$" , "Fortran" )
.replace("rs" , "Rust" )
.replace("go" , "Go" )
.replace("asm" , "Assembly" )
.nullif("" ),
)
.aggregate(project_count= _.project_name.nunique())
.dropna("ext" )
.order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()
duckdb_ibis_results = % timeit - n1 - r1 - o % run duckdb_ibis.py
df.head()
33.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
DataFusion and Polars
from __future__ import annotations
import ibis
from ibis import _
ibis.set_backend("datafusion" )
expr = (
ibis.read_parquet("/data/pypi-parquet/*.parquet" )
.filter (
[
_.path.re_search(
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
),
~ _.path.re_search(r"(^|/)test(|s|ing)" ),
~ _.path.contains("/site-packages/" ),
]
)
.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 )
.re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" )
.re_replace("^f.*$" , "Fortran" )
.replace("rs" , "Rust" )
.replace("go" , "Go" )
.replace("asm" , "Assembly" )
.nullif("" ),
)
.aggregate(project_count= _.project_name.nunique())
.dropna("ext" )
.order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()
datafusion_ibis_results = % timeit - n1 - r1 - o % run datafusion_ibis.py
df.head()
8min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
DataFusion SQL
SELECT
month ,
ext,
COUNT (DISTINCT project_name) AS project_count
FROM (
SELECT
project_name,
DATE_TRUNC('month' , uploaded_on) AS month ,
NULLIF (
REPLACE (
REPLACE (
REPLACE (
REGEXP_REPLACE (
REGEXP_REPLACE (
REGEXP_MATCH(path, CONCAT ('(' , '\.([a-z0-9]+)$' , ')' ))[2 ],
'cxx|cpp|cc|c|hpp|h' ,
'C/C++' ,
'g'
),
'^f.*$' ,
'Fortran' ,
'g'
),
'rs' ,
'Rust'
),
'go' ,
'Go'
),
'asm' ,
'Assembly'
),
''
) AS ext
FROM pypi
WHERE COALESCE (
ARRAY_LENGTH(
REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$' )
) > 0 ,
FALSE
)
AND NOT COALESCE (ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)' )) > 0 , FALSE )
AND NOT STRPOS(path, '/site-packages/' ) > 0
)
WHERE ext IS NOT NULL
GROUP BY month , ext
ORDER BY month DESC , project_count DESC
from __future__ import annotations
import datafusion
with open ("./datafusion_native.sql" ) as f:
query = f.read()
ctx = datafusion.SessionContext()
ctx.register_parquet(name= "pypi" , path= "/data/pypi-parquet/*.parquet" )
expr = ctx.sql(query)
df = expr.to_pandas()
datafusion_native_results = % timeit - n1 - r1 - o % run datafusion_native.py
df.head()
9min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
from __future__ import annotations
import ibis
from ibis import _
ibis.set_backend("polars" )
expr = (
ibis.read_parquet("/data/pypi-parquet/*.parquet" )
.filter (
[
_.path.re_search(
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
),
~ _.path.re_search(r"(^|/)test(|s|ing)" ),
~ _.path.contains("/site-packages/" ),
]
)
.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 )
.re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" )
.re_replace("^f.*$" , "Fortran" )
.replace("rs" , "Rust" )
.replace("go" , "Go" )
.replace("asm" , "Assembly" )
.nullif("" ),
)
.aggregate(project_count= _.project_name.nunique())
.dropna("ext" )
.order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas(streaming= True )
polars_ibis_results = % timeit - n1 - r1 - o % run polars_ibis.py
df.head()
6min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
from __future__ import annotations
import polars as pl
expr = (
pl.scan_parquet("/data/pypi-parquet/*.parquet" )
.filter (
[
pl.col("path" ).str .contains(
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
),
~ pl.col("path" ).str .contains(r"(^|/)test(|s|ing)" ),
~ pl.col("path" ).str .contains("/site-packages/" , literal= True ),
]
)
.with_columns(
month= pl.col("uploaded_on" ).dt.truncate("1mo" ),
ext= pl.col("path" )
.str .extract(pattern= r"\.([a-z0-9]+)$" , group_index= 1 )
.str .replace_all(pattern= r"cxx|cpp|cc|c|hpp|h" , value= "C/C++" )
.str .replace_all(pattern= "^f.*$" , value= "Fortran" )
.str .replace("rs" , "Rust" , literal= True )
.str .replace("go" , "Go" , literal= True )
.str .replace("asm" , "Assembly" , literal= True )
.replace({"" : None }),
)
.group_by(["month" , "ext" ])
.agg(project_count= pl.col("project_name" ).n_unique())
.drop_nulls(["ext" ])
.sort(["month" , "project_count" ], descending= True )
)
df = expr.collect(streaming= True ).to_pandas()
polars_native_results = % timeit - n1 - r1 - o % run polars_native.py
df.head()
6min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
Takeaways
Ibis + DuckDB is the only system tested that handles this workload well out of the box
Both Polars and DataFusion are much slower than DuckDB and Dask on this workload.
Polars memory use fluctuates quite bit, while DataFusion’s memory profile is similar to DuckDB.
Let’s recap the results with some numbers:
Numbers
Ibis + DuckDB
25,825 MiB
34 seconds
763 MiB/s
Ibis + Polars
25,825 MiB
393 seconds
66 MiB/s
Polars native API
25,825 MiB
415 seconds
62 MiB/s
Ibis + DataFusion
25,825 MiB
535 seconds
48 MiB/s
DataFusion native API
25,825 MiB
545 seconds
47 MiB/s
I couldn’t figure out how to get consistent run times.
Conclusion
If you’re considering Polars for new code, give Ibis a try with the DuckDB backend .
You’ll get better performance than Polars on some workloads, and with a broader cross-backend API that helps you scale from development to production.
If you find that Polars has better performance than DuckDB on a particular workload you can always switch to the Polars backend for that workload.
Everyone wins!
In the next post in this series we’ll cover the cloud backends: Snowflake, BigQuery, Trino and ClickHouse.
Back to top