Ibis 8.0: streaming and more!

release
blog
Author

Ibis team

Published

February 12, 2024

Overview

Ibis 8.0 marks the first release of stream processing backends in Ibis! This enhances the composable data ecosystem vision by allowing users to implement data transformation logic in a standard Python dataframe API and execute it against either batch or streaming systems.

This release includes Apache Flink, a streaming backend, and RisingWave, a streaming database backend. We’ve also added a new batch backend with Exasol, bringing the total number of backends Ibis supports to 20.

Most geospatial operations are now supported in the DuckDB backend, making Ibis a great local option for geospatial analytics.

What is stream processing?

Stream processing systems are designed to handle high-throughput, low-latency data processing with time semantics. They are used to process data in real-time with minimum latency and are often used in applications such as fraud detection, real-time analytics, and IoT. Systems using stream processing are increasingly common in modern data applications.

Apache Flink is the most popular open-source stream processing framework, with numerous cloud options. RisingWave is an open-source Postgres-compatible streaming database with a cloud offering that is gaining popularity and simplifies the streaming experience.

Ibis now supports both and going forward can add more streaming backends to unify the Python user experience across batch and streaming systems.

Unifying batch and streaming UX in Python

Whether you’re using a batch or streaming data platform – and the lines are continually blurring between them – you’ll need a frontend to interact with as a data engineer, analyst, or scientist. If you’re using Python, that frontend is likely a dataframe API.

Standards benefit individual users by reducing the cognitive load of learning and understanding new data systems. Organizations benefit from this in the form of lower onboarding costs, easier collaboration between teams, and better interfaces for data systems.

We saw in the recent one billion row challenge post how even CSV reader keyword arguments can differ greatly between APIs. This is compounded by tightly coupling a dataframe API to every query engine, whether batch or streaming.

Ibis aims to solve this dilemma by providing a standard dataframe API that can work across data systems, whether batch or streaming. This is a long-term vision and we’re excited to take the first steps toward it in Ibis 8.0 with the launch of two streaming backends (and one more batch backend).

This allows a user to leverage DuckDB or Polars or DataFusion locally, then scale out batch processing to Snowflake or BigQuery or ClickHouse in the cloud, then switch from batch to stream processing with Apache Flink or RisingWave, all without changing their dataframe code. As Ibis adds new features and implements them across backends, users can take advantage of these features without needing to learn new APIs.

Backends

Three new backends were added in this release.

RisingWave

RisingWave has contributed second streaming backend with RisingWave. This backend is earlier in development, but we’re excited to have it in Ibis and it will continue to improve it.

Exasol

Exasol has contributed the Exasol backend. This is a traditional batch backend and brings another great option for fast batch analytics to Ibis.

Breaking changes

You can view the full changelog for additional breaking changes. There have been few that we expect to affect most users.

Note

The PM for the team was distracted playing with LLMs and didn’t write a v7 blog post, so we’re covering breaking changes and features from both below.

If you’re new to Ibis, see how to install and the getting started tutorial.

To follow along with this blog, ensure you’re on 'ibis-framework>=8,<9'. First, we’ll setup Ibis and fetch some sample data to use.

import ibis
import ibis.selectors as s

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 3

Now, fetch the penguins dataset.

t = ibis.examples.penguins.fetch()
t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

rename

The largest breaking change in Ibis 7/8 is the deprecation of relabel in favor of rename, swapping the order of the arguments. This change was made to be consistent with the rest of the Ibis API. We apologize for any inconvenience this may cause, but we believe this change will make Ibis a better and more consistent dataframe standard going forward.

In the past, you would use relabel like this:

t.relabel({"species": "SPECIES"})
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ SPECIES  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Now, you would use rename like this:

t.rename({"SPECIES": "species"})
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ SPECIES  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

or this:

t.rename(SPECIES="species")
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ SPECIES  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Functionality

A lot of new functionality has been added in Ibis 7/8.

pandas batches

The .to_pandas_batches() method can be used to output batches of pandas dataframes:

batches = t.to_pandas_batches(chunk_size=200)
for df in batches:
    print(df.shape)
(200, 8)
(144, 8)

range

The range() function can be used to create a monotonic sequence of integers:

s = ibis.range(10)
s

[0, 1, ... +8]

You can turn it into a table:

s.unnest().name("index").as_table()
┏━━━━━━━┓
┃ index ┃
┡━━━━━━━┩
│ int8  │
├───────┤
│     0 │
│     1 │
│     2 │
│      │
└───────┘

This can be useful for creating synthetic data and other use cases.

relocate

The .relocate() method can be used to move columns to the beginning of a table, which is very useful for interactive data exploration with wide tables:

t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Then:

t.relocate("sex", "year")
┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ sex     year   species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g ┃
┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ stringint64stringstringfloat64float64int64int64       │
├────────┼───────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┤
│ male  2007Adelie Torgersen39.118.71813750 │
│ female2007Adelie Torgersen39.517.41863800 │
│ female2007Adelie Torgersen40.318.01953250 │
│  │
└────────┴───────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┘

sample

The .sample() method can be used to sample rows from a table:

Number of rows returned may vary by invocation.

t.count()

344
t.sample(fraction=0.1).count()

36

negative slicing

More Pythonic slicing is now supported:

t[:3]
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t[-3:]
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species    island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├───────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ ChinstrapDream 49.618.21933775male  2009 │
│ ChinstrapDream 50.819.02104100male  2009 │
│ ChinstrapDream 50.218.71983775female2009 │
└───────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t[-6:-3]
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species    island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├───────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ ChinstrapDream 45.717.01953650female2009 │
│ ChinstrapDream 55.819.82074000male  2009 │
│ ChinstrapDream 43.518.12023400female2009 │
└───────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

geospatial operations in DuckDB

Ibis supports over 50 geospatial operations, with many being recently added to DuckDB backend. While backend-specific, this is worth calling out because it brings a great local option for geospatial analytics to Ibis. Read the first geospatial blog or the second geospatial blog to learn more.

A new zones example dataset with a geometric datatype has been added for a quick demonstration:

z = ibis.examples.zones.fetch()
z = z.relocate("geom")
z
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ geom                                                                              OBJECTID  Shape_Leng  Shape_Area  zone                     LocationID  borough  x_cent        y_cent        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ geospatial:geometryint32float64float64stringint32stringfloat64float64       │
├──────────────────────────────────────────────────────────────────────────────────┼──────────┼────────────┼────────────┼─────────────────────────┼────────────┼─────────┼──────────────┼───────────────┤
│ <POLYGON ((933100.918 192536.086, 933091.011 192572.175, 933088.585 192604.9...>10.1163570.000782Newark Airport         1EWR    9.359968e+05191376.749531 │
│ <MULTIPOLYGON (((1033269.244 172126.008, 1033439.643 170883.946, 1033473.265...>20.4334700.004866Jamaica Bay            2Queens 1.031086e+06164018.754403 │
│ <POLYGON ((1026308.77 256767.698, 1026495.593 256638.616, 1026567.23 256589....>30.0843410.000314Allerton/Pelham Gardens3Bronx  1.026453e+06254265.478659 │
│  │
└──────────────────────────────────────────────────────────────────────────────────┴──────────┴────────────┴────────────┴─────────────────────────┴────────────┴─────────┴──────────────┴───────────────┘

We can use geospatial operations on that column:

z = z.mutate(
    area=z.geom.area(),
    centroid=z.geom.centroid(),
).relocate("area", "centroid")
z
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ area          centroid                          geom                                                                              OBJECTID  Shape_Leng  Shape_Area  zone                     LocationID  borough  x_cent        y_cent        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ float64pointgeospatial:geometryint32float64float64stringint32stringfloat64float64       │
├──────────────┼──────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────┼────────────┼────────────┼─────────────────────────┼────────────┼─────────┼──────────────┼───────────────┤
│ 7.903953e+07<POINT (935996.821 191376.75)><POLYGON ((933100.918 192536.086, 933091.011 192572.175, 933088.585 192604.9...>10.1163570.000782Newark Airport         1EWR    9.359968e+05191376.749531 │
│ 1.439095e+08<POINT (1031085.719 164018.754)><MULTIPOLYGON (((1033269.244 172126.008, 1033439.643 170883.946, 1033473.265...>20.4334700.004866Jamaica Bay            2Queens 1.031086e+06164018.754403 │
│ 3.168508e+07<POINT (1026452.617 254265.479)><POLYGON ((1026308.77 256767.698, 1026495.593 256638.616, 1026567.23 256589....>30.0843410.000314Allerton/Pelham Gardens3Bronx  1.026453e+06254265.478659 │
│             │
└──────────────┴──────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────┴────────────┴────────────┴─────────────────────────┴────────────┴─────────┴──────────────┴───────────────┘

Wrapping up

Ibis 8.0 brings exciting new features and the first streaming backends into Ibis! We hope you’re excited as we are about breaking down barriers between batch and streaming systems with a standard Python dataframe API.

As always, try Ibis by installing and getting started.

If you run into any issues or find support is lacking for your backend, open an issue or discussion and let us know!

Back to top