Introduction
Ibis 3.1 has officially been released as the latest version of the package. With this release comes new convenience features, increased backend operation coverage and a plethora of bug fixes. As usual, a full list of the changes can be found in the project release notes here Let’s talk about some of the new changes 3.1 brings for Ibis users.
ibis.connect
The first significant change to note is that, Ibis now provides a more convenient way to connect to a backend using the ibis.connect
method. You can now use this function to connect to an appropriate backend using a connection string.
Here are some examples:
Initialize a DuckDB instance using 'duckdb://:memory:'
= ibis.connect('duckdb://:memory:') conn
And begin registering your tables:
'csv://farm_data/dates.csv', 'dates')
conn.register('csv://farm_data/farmer_groups.csv', 'farmer_groups')
conn.register('csv://farm_data/crops.csv', 'crops')
conn.register('csv://farm_data/farms.csv', 'farms')
conn.register('csv://farm_data/harvest.csv', 'harvest')
conn.register('csv://farm_data/farmers.csv', 'farmers')
conn.register('csv://farm_data/tracts.csv', 'tracts')
conn.register('csv://farm_data/fields.csv', 'fields') conn.register(
You can also do this programmatically:
= glob.glob('farm_data/*.csv')
files
for file in files:
= 'csv://' + file
fname = file.replace('farm_data/', '').replace('.csv', '')
tname conn.register(fname, tname)
This method isn’t limited to csv://
. It works with parquet://
and csv.gz://
as well. Give it a try!
= ibis.connect('postgres://<username>:<password>@<host>:<port>/<database>') conn
Or, using a .pgpass
file:
= ibis.connect('postgres://<username>@<host>:<port>/<database>') conn
Unnest Support
One of the trickier parts about working with data is that it doesn’t usually come organized in neat, predictable rows and columns. Instead data often consists of rows that could contain a single bit of data or arrays of it. When data is organized in layers, as with arrays, it can sometimes be difficult to work with. Ibis 3.1 introduces the unnest
function as a way to flatten arrays of data.
Unnest takes a column containing an array of values and separates the individual values into rows as shown:
Before Unnest:
col |
---|
[1, 2] |
After Unnest:
col |
---|
1 |
2 |
Here is a self-contained example of creating a dataset with an array and then unnesting it:
import ibis
import pandas as pd
# Parquet save path
= 'array_data.parquet'
fname
# Mock Data
= [
data 'array_id', 'array_value']
[1, [1, 3, 4]]
,[2, [2, 4, 5]]
,[3, [6, 8]]
,[4, [1, 6]]
,[
]
# Save as parquet
1:], columns=data[0]).to_parquet(fname)
pd.DataFrame(data[
# Connect to the file using a DuckDB backend
= ibis.connect(f"duckdb://{fname}")
conn
# Create a table expression for your loaded data
= conn.table("array_data")
array_data
# Optionally execute the array data to preview
array_data.execute()
# select the unnested values with their corresponding IDs
'array_id', array_data['array_value'].unnest()]).execute() array_data.select([
import ibis
import pandas as pd
# Postgres connection string for user 'ibistutorials' with a valid .pgpass file in `/
# See https://www.postgresql.org/docs/9.3/libpq-pgpass.html for details on `/.pgpass
= 'postgres://ibistutorials@localhost:5432/pg-ibis'
cstring
# Mock Data
= [
data 'array_id', 'array_value']
[1, [1, 3, 4]]
,[2, [2, 4, 5]]
,[3, [6, 8]]
,[4, [1, 6]]
,[
]
# Create a dataframe for easy loading
= pd.DataFrame(data[1:], columns=data[0])
df
# Postgres backend connection
= ibis.connect(cstring)
conn
# SQLAlchemy Types
# Integer type
= ibis.backends.postgres.sa.types.INT()
int_type # Array type function
= ibis.backends.postgres.sa.types.ARRAY
arr_f
# Load data to table using pd.DataFrame.to_sql
df.to_sql(='array_data'
name=conn.con.connect()
,con='replace'
,if_exists=False
,index={
,dtype'array_id': int_type
'array_value': arr_f(int_type)
,
}
)
# Array Data Table Expression
= conn.table("array_data")
array_data
# Optionally execute to preview entire table
# array_data.execute()
# Unnest
'array_id', array_data['array_value'].unnest()]).execute() array_data.select([
_
API
There is now a shorthand for lambda functions using underscore (_
). This is useful for chaining expressions to one another and helps reduce total line characters and appearances of lambdas.
For example, let’s use array_data
from above. We will unnest array_value
, find the weighted average, and then sum in one expression:
from ibis import _
(
array_data
.select(['array_id'
# array_data returns a TableExpr, `_` here is shorthand
# for that returned expression
'array_value'].unnest().name('arval')
,_[# we can use it instead of saying `array_data`
'array_value'].length().cast('float')
,(_[/ _['array_value'].length().sum().cast('float')).name('wgt')
])# Since the above `select` statement returns a TableExpr, we can use
# `_` to reference that one as well:
=_.arval * _.wgt)
.mutate(wgt_prod# And again:
=_.wgt_prod.sum(), vcount=_.wgt_prod.count())
.aggregate(vsum# And again:
=_.vsum / _.vcount)
.mutate(wgt_mean ).execute()
Note that if you import _
directly from ibis
(from ibis import _
), the default _
object will lose its functionality, so be mindful if you have a habit of using it outside of Ibis.
Additional Changes
Along with these changes, the operation matrix has had a few more holes filled. Contributors should note that backend test data is now loaded dynamically. Most users won’t be exposed to this update, but it should make contribution a bit more streamlined.
To see the full patch notes, go to the patch notes page
As always, Ibis is free and open source. Contributions are welcome and encouraged–drop into the discussions, raise an issue, or put in a pull request.
Download ibis 3.1 today!