Ibis versus X: Performance across the ecosystem part 1
blog
case study
ecosystem
performance
Author
Phillip Cloud
Published
December 6, 2023
TL; DR: Ibis has a lot of great backends. They’re all good at different things. For working with local data, it’s hard to beat DuckDB on feature set and performance.
I thought it would be interesting to see how other tools compare to this setup, so I decided I’d try to do the same workflow on the same machine using a few tools from across the ecosystem.
I chose two incumbents–pandas and dask–to see how they compare to Ibis + DuckDB on this workload. In part 2 of this series I will compare two newer engines–Polars and DataFusion–to Ibis + DuckDB.
I’ve worked on both pandas and Dask in the past but it’s been such a long time since I’ve used these tools for data analysis that I consider myself rather naive about how to best use them today.
Initially I was interested in API comparisons since usability is really where Ibis shines, but as I started to explore things, I was unable to complete my analysis in some cases due to running out of memory.
This is not a forum to trash the work of others.
I’m not interested in tearing down other tools.
Ibis has backends for each of these tools and it’s in everyone’s best interest that all of the tools discussed here work to their full potential.
I show each tool using its native API, in an attempt to compare ease-of-use out of the box and maximize each library’s ability to complete the workload.
Let’s dig in.
Setup
I ran all of the code in this blog post on a machine with these specs.
All OS caches were cleared before running this document with
$ sudo sysctl -w vm.drop_caches=3
Clearing operating system caches does not represent a realistic usage scenario
It is a method for putting the tools here on more equal footing. When you’re in the thick of an analysis you’re not going to artificially limit any OS optimizations.
Component
Specification
CPU
AMD EPYC 7B12 (64 threads)
RAM
94 GiB
Disk
1.5 TiB SSD
OS
NixOS (Linux 6.1.69)
Soft constraints
I’ll introduce some soft UX constraints on the problem, that I think help convey the perspective of someone who wants to get started quickly with a data set:
I don’t want to get another computer to run this workload.
I want to use the data as is, that is, without altering the files I already have.
I’d like to run this computation with the default configuration. Ideally configuration isn’t required to complete this workload out of the box.
Library versions
Here are the versions I used to run this experiment at the time of writing.
Dependency
Version
Python
3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0]
We’ve since implemented a flatten method on array expressions so it’s no longer necessary to define a UDF here. I’ll leave this code unchanged for this post. This has no effect on the performance of the query. In both cases the generated code contains a DuckDB-native call to its flatten function.
2
This is a small change from the original query that adds a final sort key to make the results deterministic.
%time df = expr.to_pandas()df
CPU times: user 19min 17s, sys: 1min 9s, total: 20min 26s
Wall time: 27 s
month
ext
project_count
0
2023-11-01
C/C++
836
1
2023-11-01
Rust
190
2
2023-11-01
Fortran
48
3
2023-11-01
Go
33
4
2023-11-01
Assembly
10
...
...
...
...
794
2005-08-01
C/C++
7
795
2005-07-01
C/C++
4
796
2005-05-01
C/C++
1
797
2005-04-01
C/C++
1
798
2005-03-01
C/C++
1
799 rows × 3 columns
Let’s show peak memory usage in GB as reported by the resource module:
This eventually triggers the Linux OOM killer after some minutes, so I can’t run the code.
Let’s try again with just a single file. I’ll pick the smallest file, to avoid any potential issues with memory and give pandas the best possible shot.
CPU times: user 5.36 ms, sys: 25 µs, total: 5.38 ms
Wall time: 5.28 ms
projects
month
ext
2023-10-01
Assembly
14
C/C++
484
Fortran
23
Go
25
Rust
99
2023-11-01
Assembly
10
C/C++
836
Fortran
48
Go
33
Rust
190
Remember, all of the previous code is executing on a single file and still takes minutes to run.
Conclusion
If I only have pandas at my disposal, I’m unsure of how I can avoid getting a bigger computer to run this query over the entire data set.
Rewriting the query to be fair
At this point I wondered whether this was a fair query to run with pandas.
After all, the downsides of pandas’ use of object arrays to hold nested data structures like lists are well-known.
The original query uses a lot of nested array types, which are very performant in DuckDB, but in this case we’re throwing away all of our arrays and we don’t need to use them.
Additionally, I’m using lambda functions instead of taking advantage of pandas’ fast built-in methods like count, nunique and others.
Let’s see if we can alter the original query to give pandas a leg up.
Notice this order_by call just before a group_by call. Ordering before grouping is somewhat useless here; we should probably sort after we’ve reduced our data. Let’s stick the ordering at the end of the query.
29.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
It looks like the new query might be a bit slower even though we’re ostensibly doing less computation. Since we’re still pretty close to the original duration, let’s keep going.
Ensure that pyarrow string arrays are used instead of NumPy object arrays. This required no changes to my Dask code because PyArrow strings have been the default since version 2023.7.1, hooray!
Explore some of the options to read_parquet. It turned that without setting split_row_groups=True I ran out of memory.
With Ibis + DuckDB, I was able to write the query the way I wanted to without running out of memory, using the default configuration provided by Ibis.
I was able run this computation around 193x faster than you can expect with pandas using this hardware setup.
In contrast, pandas ran out of memory on a single file without some hand holding and while Dask didn’t cause my program to run out of memory it still used quite a bit more than DuckDB.
Pandas is untenable for this workload
Pandas requires me to load everything into memory, and my machine doesn’t have enough memory to do that.
Given that Ibis + DuckDB runs this workload on my machine it doesn’t seem worth the effort to write any additional code to make pandas scale to the whole dataset.
Dask finishes in a similar amount of time as Ibis + DuckDB (within 2x)
Out of the box I had quite a bit of difficulty figuring out how to maximize performance and not run out of memory.
Please get in touch if you think my Dask code can be improved!
I know the Dask community is hard at work building dask-expr which might improve the performance of this workload when it lands.
Next steps
Please get in touch!
If you have ideas about how to speed up my use of the tools I’ve discussed here please get in touch by opening a GitHub discussion!
We would love it if more backends handled this workload!
Look out for part 2
In part 2 of this series we’ll explore how Polars and DataFusion perform on this query. Stay tuned!