Ibis versus X: Performance across the ecosystem part 2

blog
case study
ecosystem
performance
Author

Phillip Cloud

Published

December 11, 2023

TL; DR: Ibis supports both Polars and DataFusion. Both backends are have about the same runtime performance, and lag far behind DuckDB on this workload. There’s negligible performance difference between Ibis and the backend native APIs.

Motivation

This is part 2 of a series of posts showing performance across various backends that Ibis supports.

Check out part 1 if you haven’t already!

In this post, I’ll continue with the Polars and DataFusion backends.

I show each tool using both the Ibis API and the tool’s native API. We’ll see that the performance difference between these approaches is negligible.

Setup

I ran all of the code in this blog post on a machine with these specs.

Component Specification
CPU AMD EPYC 7B12 (64 threads)
RAM 94 GiB
Disk 1.5 TiB SSD
OS NixOS (Linux 6.1.66)

Library versions

Here are the versions I used to run this experiment at the time of writing.

Dependency Version
Python 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0]
datafusion 33.0.0
ibis 2b54b9800
pandas 2.1.4
polars 0.19.19
pyarrow 14.0.1

Running the query across backends

Here are the different Ibis expressions for each backend as well as the same query with native APIs, along with timed executions of the query.

DuckDB

First, let’s run the Ibis + DuckDB version of the query from the original post:

from __future__ import annotations

import ibis
from ibis import _

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()
duckdb_ibis_results = %timeit -n1 -r1 -o %run duckdb_ibis.py
df.head()
33.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
month ext project_count
0 2023-11-01 C/C++ 836
1 2023-11-01 Rust 190
2 2023-11-01 Fortran 48
3 2023-11-01 Go 33
4 2023-11-01 Assembly 10

DataFusion and Polars

from __future__ import annotations

import ibis
from ibis import _

ibis.set_backend("datafusion")

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas()
datafusion_ibis_results = %timeit -n1 -r1 -o %run datafusion_ibis.py
df.head()
8min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
month ext project_count
0 2023-11-01 C/C++ 836
1 2023-11-01 Rust 190
2 2023-11-01 Fortran 48
3 2023-11-01 Go 33
4 2023-11-01 Assembly 10
DataFusion SQL
SELECT
  month,
  ext,
  COUNT(DISTINCT project_name) AS project_count
FROM (
  SELECT
    project_name,
    DATE_TRUNC('month', uploaded_on) AS month,
    NULLIF(
      REPLACE(
        REPLACE(
          REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                'cxx|cpp|cc|c|hpp|h',
                'C/C++',
                'g'
              ),
              '^f.*$',
              'Fortran',
              'g'
            ),
            'rs',
            'Rust'
          ),
          'go',
          'Go'
        ),
        'asm',
        'Assembly'
      ),
      ''
    ) AS ext
  FROM pypi
  WHERE COALESCE(
      ARRAY_LENGTH(
        REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
      ) > 0,
      FALSE
    )
    AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
    AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC
from __future__ import annotations

import datafusion

with open("./datafusion_native.sql") as f:
    query = f.read()

ctx = datafusion.SessionContext()
ctx.register_parquet(name="pypi", path="/data/pypi-parquet/*.parquet")
expr = ctx.sql(query)

df = expr.to_pandas()
datafusion_native_results = %timeit -n1 -r1 -o %run datafusion_native.py
df.head()
9min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
month ext project_count
0 2023-11-01 C/C++ 836
1 2023-11-01 Rust 190
2 2023-11-01 Fortran 48
3 2023-11-01 Go 33
4 2023-11-01 Assembly 10
from __future__ import annotations

import ibis
from ibis import _

ibis.set_backend("polars")

expr = (
    ibis.read_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1)
        .re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .aggregate(project_count=_.project_name.nunique())
    .dropna("ext")
    .order_by([_.month.desc(), _.project_count.desc()])
)
df = expr.to_pandas(streaming=True)
polars_ibis_results = %timeit -n1 -r1 -o %run polars_ibis.py
df.head()
6min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
month ext project_count
0 2023-11-01 C/C++ 836
1 2023-11-01 Rust 190
2 2023-11-01 Fortran 48
3 2023-11-01 Go 33
4 2023-11-01 Assembly 10
from __future__ import annotations

import polars as pl

expr = (
    pl.scan_parquet("/data/pypi-parquet/*.parquet")
    .filter(
        [
            pl.col("path").str.contains(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~pl.col("path").str.contains(r"(^|/)test(|s|ing)"),
            ~pl.col("path").str.contains("/site-packages/", literal=True),
        ]
    )
    .with_columns(
        month=pl.col("uploaded_on").dt.truncate("1mo"),
        ext=pl.col("path")
        .str.extract(pattern=r"\.([a-z0-9]+)$", group_index=1)
        .str.replace_all(pattern=r"cxx|cpp|cc|c|hpp|h", value="C/C++")
        .str.replace_all(pattern="^f.*$", value="Fortran")
        .str.replace("rs", "Rust", literal=True)
        .str.replace("go", "Go", literal=True)
        .str.replace("asm", "Assembly", literal=True)
        .replace({"": None}),
    )
    .group_by(["month", "ext"])
    .agg(project_count=pl.col("project_name").n_unique())
    .drop_nulls(["ext"])
    .sort(["month", "project_count"], descending=True)
)

df = expr.collect(streaming=True).to_pandas()
polars_native_results = %timeit -n1 -r1 -o %run polars_native.py
df.head()
6min 54s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
month ext project_count
0 2023-11-01 C/C++ 836
1 2023-11-01 Rust 190
2 2023-11-01 Fortran 48
3 2023-11-01 Go 33
4 2023-11-01 Assembly 10

Takeaways

Ibis + DuckDB is the only system tested that handles this workload well out of the box

  • Both Polars and DataFusion are much slower than DuckDB and Dask on this workload.
  • Polars memory use fluctuates quite bit, while DataFusion’s memory profile is similar to DuckDB.

Let’s recap the results with some numbers:

Numbers

Toolset Data size Duration Throughput
Ibis + DuckDB 25,825 MiB 34 seconds 763 MiB/s
Ibis + Polars 25,825 MiB 393 seconds 66 MiB/s
Polars native API 25,825 MiB 415 seconds 62 MiB/s
Ibis + DataFusion 25,825 MiB 535 seconds 48 MiB/s
DataFusion native API 25,825 MiB 545 seconds 47 MiB/s
The Polars run durations were highly variable

I couldn’t figure out how to get consistent run times.

Conclusion

If you’re considering Polars for new code, give Ibis a try with the DuckDB backend.

You’ll get better performance than Polars on some workloads, and with a broader cross-backend API that helps you scale from development to production.

If you find that Polars has better performance than DuckDB on a particular workload you can always switch to the Polars backend for that workload.

Everyone wins!

In the next post in this series we’ll cover the cloud backends: Snowflake, BigQuery, Trino and ClickHouse.

Back to top