Ibis 9.0: SQLGlot-ification

release
blog
Author

Ibis team

Published

May 1, 2024

Overview

Ibis 9.0 wraps up “the big refactor”, completing the transition from SQLAlchemy to SQLGlot and drastically simplifying the codebase. This is a big step toward stabilized internals and allows us to more easily add new features and backends going forward. Look out for a dedicated blog post on the refactor soon.

Tip

One long-standing issue with Ibis has been chained joins, which now work as a user would expect. This was a major motivation for the refactor.

Ibis 9.0 also adds new functionality, fixes many bugs, and welcomes two new committers to the project!

Meet the Ibis team!

Members of the core Ibis team will be at PyCon 2024 in Pittsburg, PA, USA in May and SciPy 2024 in Tacoma, WA, USA in July! Stop by at one of the talks or tutorials to say hi in English and/or Español:

Note

The schedule for SciPy 2024 is not available yet, keep an eye out for Ibis sessions.

You can also listen to a recent podcast episode with Real Python or another with :probabl. where Phillip Cloud, lead maintainer of Ibis, talks about the project.

More podcasts and talks are in the works, so stay tuned!

New committers

We’re excited to welcome Nick Crews and Naty Clementi as the newest committers to the Ibis project!

Nick has been a long-time contributor to Ibis, one of the most active people on the issue tracker, and increasingly involved in the community. We’re excited to have him on board!

Naty has been contributing to Ibis for a while now, leading the effort to add geospatial support to the DuckDB backend in addition to various other contributions. We’re excited to have her on board!

9.0 contributors

Ibis 9.0 had contributions from many people, including commits from:

  • Phillip Cloud
  • Krisztián Szűcs
  • Gil Forsyth
  • Jim Crist-Harif
  • Cody Peterson
  • Nick Crews
  • Naty Clementi
  • Chloe He
  • Nicola Coretti
  • Deepyaman Datta
  • Péter Gyarmati
  • Jiting Xu
  • Mehmet Fatih Aktas
  • Tyler White
  • Bryce Mecum
  • Riezebos
  • Chelsea Lin
  • Chip Huyen
  • Dan Lovell
  • Daniel Mesejo
  • Kexiang Wang
  • Mark Needham
  • Ray Bell
  • Thierry Jean
  • pieter-factful
  • saschahofmann

Thank you to everyone who contributed to this release! And a special thanks to Krisztián Szűcs for his work on the internal representation and SQLGlot refactor work, it has drastically improved the Ibis codebase.

With this refactor, it’s never been easier to contribute to Ibis! Get on touch via GitHub or Zulip if you’re interested.

Backends

No new backends were added in this release, unless you count the April Fools’ Day Unix backend! As always, there have been backend-specific improvements and bug fixes. Some highlights are below.

Check the full changelog for more details.

If you’re new to Ibis, see how to install and the getting started tutorial.

To follow along with this blog, ensure you’re on 'ibis-framework>=9,<10'. First, we’ll setup Ibis for interactive use:

import ibis
import ibis.selectors as s

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 3

And fetch the penguins dataset:

t = ibis.examples.penguins.fetch()
t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Snowflake

You can now run Ibis inside of a Snowflake stored procedure! Look for a blog on this coming soon.

DuckDB

In this release, asof joins are also now supported with a tolerance parameter.

With DuckDB’s addition of MySQL as a database it can attach to, we’ve added a read_mysql function to Ibis inline with read_postgres and read_sqlite.

You can now cast binary data to geometry in the DuckDB backend, enabling use of regular Parquet files:

zones = ibis.examples.zones.fetch().relocate("geom")
zones.to_parquet("zones.parquet")
1
Fetch the zones example data with the geometry column.
2
Write to a Parquet file, converting the geometry column to binary.
zones = ibis.read_parquet("zones.parquet")
zones.schema()
1
Load in the zones Paruqet file.
2
Notice the geometry column is binary.
ibis.Schema {
  geom        binary
  OBJECTID    int32
  Shape_Leng  float64
  Shape_Area  float64
  zone        string
  LocationID  int32
  borough     string
  x_cent      float64
  y_cent      float64
}
zones = zones.mutate(geom=zones["geom"].cast("geometry"))
zones.schema()
1
Cast the binary geometry column to a geometry column.
2
Notice the geometry column is now a geometry type.
ibis.Schema {
  geom        geospatial:geometry
  OBJECTID    int32
  Shape_Leng  float64
  Shape_Area  float64
  zone        string
  LocationID  int32
  borough     string
  x_cent      float64
  y_cent      float64
}

PySpark

PySpark 3.5 is now supported.

Exasol

Numerous improvements have been made to the Exasol backend by Nicola including bitwise operations, time operations, and hexdigest.

RisingWave

Streaming data definition language (DDL) support has been added to the RisingWave backend.

Breaking changes

You can view the full changelog for additional breaking changes. There is one we expect to affect many users.

What does “schema” mean?

In building a standard Python dataframe interface, we must standardize naming conventions and terminology across data systems. Many systems use the words “schema”, “database”, “catalog”, and “namespace” to mean different things.

In Ibis, a “schema” is a mapping of column names to their types:

t.schema()
ibis.Schema {
  species            string
  island             string
  bill_length_mm     float64
  bill_depth_mm      float64
  flipper_length_mm  int64
  body_mass_g        int64
  sex                string
  year               int64
}

And as of Ibis 9.0, that is all “schema” means! We have standardizes on table as the container of data which has a schema. A database is a collection of tables, and a catalog is a collection of databases. Not all backends will support all levels of hierarchy.

Existing methods using the hierarchical meaning of “schema” will now raise a deprecation warning, and the arguments throughout the API have been updated to reflect the standardization.

For example, list_schemas() will give a deprecation warning:

con = ibis.connect("duckdb://")
con.list_schemas()
/var/folders/hm/f6m13d5d4xg6v606d3mfndnh0000gn/T/ipykernel_6573/2769547408.py:2: FutureWarning:

`CanListSchema.list_schemas` is deprecated as of v9.0, removed in v10.0; Use `list_databases` instead`
['information_schema', 'main', 'pg_catalog']

Use list_databases() instead:

con.list_databases()
['information_schema', 'main', 'pg_catalog']

You can also list_catalogs():

con.list_catalogs()
['memory', 'system', 'temp']

And get the current database:

con.current_database
'main'

And the current catalog:

con.current_catalog
'memory'

Other breaking changes

There were a few more minor breaking changes, see the full changelog for more details.

Functionality

A lot of new functionality has been added in Ibis 9.0.

Python 3.12

Ibis now supports Python 3.12!

Describe

Ibis now has a describe method to get a summary of a table:

t.describe()
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ name               type     count  nulls  unique  mode    mean         std         min      p25       p50      p75      max     ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ stringstringint64int64int64stringfloat64float64float64float64float64float64float64 │
├───────────────────┼─────────┼───────┼───────┼────────┼────────┼─────────────┼────────────┼─────────┼──────────┼─────────┼─────────┼─────────┤
│ species          string 34403AdelieNULLNULLNULLNULLNULLNULLNULL │
│ island           string 34403BiscoeNULLNULLNULLNULLNULLNULLNULL │
│ bill_length_mm   float643442164NULL43.9219305.45958432.139.22544.4548.559.6 │
│ bill_depth_mm    float64344280NULL17.1511701.97479313.115.60017.3018.721.5 │
│ flipper_length_mmint64  344255NULL200.91520514.061714172.0190.000197.00213.0231.0 │
│ body_mass_g      int64  344294NULL4201.754386801.9545362700.03550.0004050.004750.06300.0 │
│ sex              string 344112male  NULLNULLNULLNULLNULLNULLNULL │
│ year             int64  34403NULL2008.0290700.8183562007.02007.0002008.002009.02009.0 │
└───────────────────┴─────────┴───────┴───────┴────────┴────────┴─────────────┴────────────┴─────────┴──────────┴─────────┴─────────┴─────────┘

GraphViz custom attributes

Thanks to a couple community contributions, you can now pass custom attributes to the GraphViz visualization of an expression.

Tip

You can also call expr.visualize() and pass the same arguments.

from ibis.expr.visualize import to_graph

expr = (
    t.group_by("species", "island")
    .agg(count=t.count())
    .order_by(ibis._["count"].desc())
)

to_graph(
    expr,
    label_edges=True,
    node_attr={"shape": "hexagon", "color": "green", "fontname": "Roboto Mono"},
    edge_attr={"fontsize": "12", "fontname": "Comic Sans MS"},
)

Polars input and output

You can now directly output a Polars dataframe from an Ibis table (instead of going through Apache Arrow as previously possible):

t.to_polars()
shape: (344, 8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
str str f64 f64 i64 i64 str i64
"Adelie" "Torgersen" 39.1 18.7 181 3750 "male" 2007
"Adelie" "Torgersen" 39.5 17.4 186 3800 "female" 2007
"Adelie" "Torgersen" 40.3 18.0 195 3250 "female" 2007
"Adelie" "Torgersen" null null null null null 2007
"Adelie" "Torgersen" 36.7 19.3 193 3450 "female" 2007
"Chinstrap" "Dream" 55.8 19.8 207 4000 "male" 2009
"Chinstrap" "Dream" 43.5 18.1 202 3400 "female" 2009
"Chinstrap" "Dream" 49.6 18.2 193 3775 "male" 2009
"Chinstrap" "Dream" 50.8 19.0 210 4100 "male" 2009
"Chinstrap" "Dream" 50.2 18.7 198 3775 "female" 2009

And you can directly construct an Ibis memtable from a Polars dataframe:

ibis.memtable(t.to_polars())
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

These conversions are efficient and improve usability of Ibis and Polars together.

Scalar subqueries from expressions

You can now create scalar subqueries from expressions:

t.select("species").limit(1).as_scalar()

'Adelie'

Disconnect

You can now explicitly disconnect from a backend. First, connect to a backend:

con = ibis.connect("duckdb://")
con.list_tables()
[]

Then you can disconnect. Trying to use the connection after will raise an error:

con.disconnect()
con.list_tables()
ConnectionException: Connection Error: Connection has already been closed

Today

You can now get the current date more easily:

ibis.today()

datetime.date(2024, 5, 2)

And use it in expressions:

t.mutate(ingested_at=ibis.today()).relocate("ingested_at")
┏━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ ingested_at  species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ datestringstringfloat64float64int64int64stringint64 │
├─────────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ 2024-05-02Adelie Torgersen39.118.71813750male  2007 │
│ 2024-05-02Adelie Torgersen39.517.41863800female2007 │
│ 2024-05-02Adelie Torgersen40.318.01953250female2007 │
│  │
└─────────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

UUIDs

You can now generate universally unique identifiers (UUIDs) in Ibis more easily:

ibis.uuid()

UUID('541ab072-9f28-46dd-b402-8ebf3b697fc8')

And use them in expressions:

t.mutate(id=ibis.uuid()).relocate("id")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ id                                    species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ uuidstringstringfloat64float64int64int64stringint64 │
├──────────────────────────────────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ 17b11546-2573-4ed6-84b8-0c3ee6f89060 │ Adelie Torgersen39.118.71813750male  2007 │
│ 0acdb8bf-dc1b-4cfe-bdba-8c5be4a5f84a │ Adelie Torgersen39.517.41863800female2007 │
│ ac8830ac-74d1-44fc-a4dc-b5b7a0343111 │ Adelie Torgersen40.318.01953250female2007 │
│  │
└──────────────────────────────────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

TopK includes NULLs

This could technically be considered a breaking change or bug fix. Regardless, topk now includes NULLs by default:

t["sex"].topk(3)
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ sex     CountStar(penguins) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64               │
├────────┼─────────────────────┤
│ male  168 │
│ female165 │
│ NULL11 │
└────────┴─────────────────────┘

Order and group by variadic arguments

For consistency with other methods, order_by and group_by now accept variadic arguments:

t.order_by(t["body_mass_g"].desc(), "island", t["species"])
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Gentoo Biscoe49.215.22216300male  2007 │
│ Gentoo Biscoe59.617.02306050male  2007 │
│ Gentoo Biscoe51.116.32206000male  2008 │
│  │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
grouped = (
    t.group_by("species", "island")
    .agg(count=t.count())
    .order_by(ibis._["count"].desc())
)
grouped
┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species    island  count ┃
┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼────────┼───────┤
│ Gentoo   Biscoe124 │
│ ChinstrapDream 68 │
│ Adelie   Dream 56 │
│  │
└───────────┴────────┴───────┘

You can still pass in a list or tuple if you prefer.

Unwrap JSON values

You can now unwrap JSON values into backend-native values:

data = {
    "jstring": ['"a"', '""', None, "null"],
    "jbool": ["true", "false", "null", None],
    "jint": ["1", "null", None, "2"],
    "jfloat": ["42.42", None, "null", "37.37"],
    "jmap": ['{"a": 1}', "null", None, "{}"],
    "jarray": ["[]", "null", None, '[{},"1",2]'],
}

t = ibis.memtable(data, schema=dict.fromkeys(data.keys(), "json"))
t
┏━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
┃ jstring  jbool  jint  jfloat  jmap      jarray ┃
┡━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
│ jsonjsonjsonjsonjsonjson   │
├─────────┼───────┼──────┼────────┼──────────┼────────┤
│ 'a'True142.42{'a': 1}[]     │
│ ''FalseNoneNULLNoneNone   │
│ NULLNoneNULLNoneNULLNULL   │
│       │
└─────────┴───────┴──────┴────────┴──────────┴────────┘
t.select(unwrapped=t.jstring.unwrap_as(str), original=t.jstring)
┏━━━━━━━━━━━┳━━━━━━━━━━┓
┃ unwrapped  original ┃
┡━━━━━━━━━━━╇━━━━━━━━━━┩
│ stringjson     │
├───────────┼──────────┤
│ a        'a'      │
│ ~''       │
│ NULLNULL     │
│         │
└───────────┴──────────┘
t.select(unwrapped=t.jbool.unwrap_as("bool"), original=t.jbool)
┏━━━━━━━━━━━┳━━━━━━━━━━┓
┃ unwrapped  original ┃
┡━━━━━━━━━━━╇━━━━━━━━━━┩
│ booleanjson     │
├───────────┼──────────┤
│ True      │ True     │
│ False     │ False    │
│ NULLNone     │
│         │
└───────────┴──────────┘

Create more backends with empty URLs

For convenience, you can now create backends with an empty URL for Dask, pandas, Polars, and DataFusion:

con = ibis.connect("dask://")
con = ibis.connect("pandas://")
con = ibis.connect("polars://")
con = ibis.connect("datafusion://")

This is in addition to the existing backend-specific connection methods:

con = ibis.dask.connect()
con = ibis.pandas.connect()
con = ibis.polars.connect()
con = ibis.datafusion.connect()

Note that URL parameters are not meaningful with these backends. The main use case is more convenient connection swapping programmatically:

for backend in [
    "duckdb",
    "sqlite",
    "pyspark",
    "dask",
    "pandas",
    "polars",
    "datafusion",
]:
    con = ibis.connect(f"{backend}://")
    print(con)
<ibis.backends.duckdb.Backend object at 0x31cd15750>
<ibis.backends.sqlite.Backend object at 0x31f5b8d50>
<ibis.backends.pyspark.Backend object at 0x31f7c46d0>
<ibis.backends.dask.Backend object at 0x31ce68c10>
<ibis.backends.pandas.Backend object at 0x340c84250>
<ibis.backends.polars.Backend object at 0x341418490>
<ibis.backends.datafusion.Backend object at 0x3414188d0>

Wrapping up

Ibis 9.0 is an exciting release that completes the transition to SQLGlot, adds a lot of new functionality, and welcomes a new committer to the project!

As always, try Ibis by installing and getting started.

If you run into any issues or find support is lacking for your backend, open an issue or discussion and let us know!

Back to top