import ibis
= True
ibis.options.interactive
@ibis.udf.scalar.builtin
def mismatches(left: str, right: str) -> int:
...
Reference built-in functions
Scalar functions
Functions that aren’t exposed in ibis directly can be accessed using the @ibis.udf.scalar.builtin
decorator.
Builtin scalar UDFs are designed to be an escape hatch when Ibis doesn’t have a defined API for a built-in database function.
See the reference documentation for existing APIs.
DuckDB
Ibis doesn’t directly expose many of the DuckDB text similarity functions. Let’s expose the mismatches
API.
The ...
is a visual indicator that the function definition is unknown to Ibis.
Ibis will not execute the function body or otherwise inspect it. Any code you write in the function body will be ignored.
We can now call this function on any ibis expression:
= ibis.duckdb.connect() con
- 1
- Connect to an in-memory DuckDB database
= mismatches("duck", "luck")
expr con.execute(expr)
1
Like any other ibis expression you can inspect the SQL:
="duckdb") ibis.to_sql(expr, dialect
- 1
-
The
dialect
keyword argument must be passed, because we constructed a literal expression which has no backend attached.
SELECT
'duck', 'luck') AS "mismatches_0('duck', 'luck')" MISMATCHES(
Similarly we can expose Duckdb’s jaro_winkler_similarity
function. Let’s alias it to jw_sim
to illustrate some more of the Ibis udf
API:
@ibis.udf.scalar.builtin(name="jaro_winkler_similarity")
def jw_sim(a: str, b: str) -> float:
...
Because built-in UDFs are ultimately Ibis expressions, they compose with the rest of the library:
= ibis.read_parquet(
pkgs "https://storage.googleapis.com/ibis-tutorial-data/pypi/2024-04-24/packages.parquet"
)= pkgs.filter(jw_sim(pkgs.name, "pandas") >= 0.9)
pandas_ish pandas_ish
┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ name ┃ version ┃ requires_python ┃ yanked ┃ has_binary_wheel ┃ has_vulnerabilities ┃ first_uploaded_at ┃ last_uploaded_at ┃ recorded_at ┃ downloads ┃ scorecard_overall ┃ in_google_assured_oss ┃ ┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ int64 │ int64 │ int64 │ timestamp(6) │ timestamp(6) │ timestamp(6) │ int64 │ float64 │ int64 │ ├──────────┼─────────┼─────────────────┼────────┼──────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────┼───────────────────┼───────────────────────┤ │ bcpandas │ 2.6.1 │ >=3.8.1 │ 1 │ 0 │ 0 │ 2024-04-04 02:41:13 │ 2024-04-04 02:41:15 │ 2024-04-24 17:31:57 │ 381247 │ NULL │ 0 │ │ espandas │ 1.0.4 │ ~ │ 0 │ 0 │ 0 │ 2018-12-22 20:52:30 │ 2018-12-22 20:52:30 │ 2024-04-24 17:57:12 │ 0 │ 3.0 │ 0 │ │ fpandas │ 0.5 │ ~ │ 0 │ 0 │ 0 │ 2020-03-09 02:35:31 │ 2020-03-09 02:35:31 │ 2024-04-24 18:02:43 │ 0 │ NULL │ 0 │ │ h3pandas │ 0.2.6 │ >=3.8 │ 0 │ 0 │ 0 │ 2023-11-21 19:49:47 │ 2023-11-21 19:49:47 │ 2024-04-24 18:08:12 │ 0 │ NULL │ 0 │ │ h5pandas │ 0.9 │ ~ │ 0 │ 0 │ 0 │ 2024-02-25 18:10:23 │ 2024-02-25 18:10:25 │ 2024-04-24 18:08:13 │ 0 │ NULL │ 0 │ │ ipandas │ 0.0.1 │ ~ │ 0 │ 0 │ 0 │ 2019-05-29 18:46:12 │ 2019-05-29 18:46:12 │ 2024-04-24 18:13:40 │ 0 │ 3.0 │ 0 │ │ kpandas │ 0.0.1 │ >=3.6,<4.0 │ 0 │ 0 │ 0 │ 2019-05-02 18:00:29 │ 2019-05-02 18:00:31 │ 2024-04-24 18:18:18 │ 0 │ NULL │ 0 │ │ lipandas │ 1.0.0 │ ~ │ 0 │ 0 │ 0 │ 2023-12-03 23:32:38 │ 2023-12-03 23:32:38 │ 2024-04-24 18:21:07 │ 0 │ NULL │ 0 │ │ mpandas │ 0.0.2.1 │ ~ │ 0 │ 0 │ 0 │ 2022-07-03 16:21:21 │ 2022-07-03 16:21:23 │ 2024-04-24 18:29:01 │ 0 │ NULL │ 0 │ │ mtpandas │ 1.16.5 │ >=3.6 │ 0 │ 0 │ 0 │ 2024-04-10 14:20:52 │ 2024-04-10 14:20:52 │ 2024-04-24 18:29:31 │ 0 │ 3.8 │ 0 │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └──────────┴─────────┴─────────────────┴────────┴──────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┴───────────────────┴───────────────────────┘
Defining Signatures
Sometimes the signatures of builtin functions are difficult to spell.
Consider a function that computes the length of any array: the elements in the array can be floats, integers, strings and even other arrays. Spelling that type is difficult.
Fortunately, the udf.scalar.builtin
decorator only requires you to specify the type of the return value. The type of the function parameters are not required. Thus, this is adequate:
@ibis.udf.scalar.builtin(name="array_length")
def cardinality(arr) -> int:
...
We can pass arrays with different element types to our cardinality
function:
1, 2, 3])) con.execute(cardinality([
3
"a", "b"])) con.execute(cardinality([
2
When you do not specify input types, Ibis isn’t able to catch typing errors early, and they are only caught during execution. The errors you get back are backend dependent:
"foo")) con.execute(cardinality(
--------------------------------------------------------------------------- BinderException Traceback (most recent call last) Cell In[10], line 1 ----> 1 con.execute(cardinality("foo")) File ~/work/ibis/ibis/ibis/backends/duckdb/__init__.py:1414, in Backend.execute(self, expr, params, limit, **_) 1411 import pandas as pd 1412 import pyarrow.types as pat -> 1414 table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow() 1416 df = pd.DataFrame( 1417 { 1418 name: ( (...) 1430 } 1431 ) 1432 df = DuckDBPandasData.convert_table(df, expr.as_table().schema()) File ~/work/ibis/ibis/ibis/backends/duckdb/__init__.py:1351, in Backend._to_duckdb_relation(self, expr, params, limit) 1349 if table_expr.schema().geospatial: 1350 self._load_extensions(["spatial"]) -> 1351 return self.con.sql(sql) BinderException: Binder Error: No function matches the given name and argument types 'array_length(STRING_LITERAL)'. You might need to add explicit type casts. Candidate functions: array_length(ANY[]) -> BIGINT array_length(ANY[], BIGINT) -> BIGINT
Here, DuckDB is informing us that the ARRAY_LENGTH
function does not accept strings as input.
Aggregate functions
Aggregate functions that aren’t exposed in ibis directly can be accessed using the @ibis.udf.agg.builtin
decorator.
Builtin aggregate UDFs are designed to be an escape hatch when Ibis doesn’t have a defined API for a built-in database function.
See the reference documentation for existing APIs.
Let’s use the DuckDB backend to demonstrate how to access an aggregate function that isn’t exposed in ibis: kurtosis
.
DuckDB
First, define the builtin aggregate function:
@ibis.udf.agg.builtin
def kurtosis(x: float) -> float:
...
- 1
- Both the input and return type annotations indicate the element type of the input, not the shape (column or scalar). Aggregations can only be called on column expressions.
One of the powerful features of this API is that you can define your UD(A)Fs at any point during your analysis. You don’t need to connect to the database to define your functions.
Let’s compute the kurtosis of the number of votes across all movies:
from ibis import _
= (
expr
ibis.examples.imdb_title_ratings.fetch()"snake_case")
.rename(=lambda t: kurtosis(t.num_votes))
.agg(kurt
) expr
┏━━━━━━━━━━━━┓ ┃ kurt ┃ ┡━━━━━━━━━━━━┩ │ float64 │ ├────────────┤ │ 4764.64335 │ └────────────┘
Since this is an aggregate function, it has the same capabilities as other, builtin aggregates like sum
: it can be used in a group by as well as in a window function expression.
Let’s compute kurtosis for all the different types of productions (shorts, movies, TV, etc):
= (
basics
ibis.examples.imdb_title_basics.fetch()"snake_case")
.rename(filter(_.is_adult == 0)
.
)= ibis.examples.imdb_title_ratings.fetch().rename("snake_case")
ratings
= ratings.join(basics, "tconst")
basics_ratings
= (
expr "title_type")
basics_ratings.group_by(=lambda t: kurtosis(t.num_votes))
.agg(kurt
.order_by(_.kurt.desc())
.head()
) expr
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ title_type ┃ kurt ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ string │ float64 │ ├──────────────┼─────────────┤ │ tvEpisode │ 7452.859695 │ │ tvSeries │ 4133.328876 │ │ short │ 3992.123752 │ │ tvMiniSeries │ 1968.324239 │ │ tvSpecial │ 1461.166556 │ └──────────────┴─────────────┘
Similarly for window functions:
= (
expr
basics_ratings.mutate(=lambda t: kurtosis(t.num_votes).over(group_by="title_type")
kurt
)"kurt", after="tconst")
.relocate(filter(
.
["godfather"),
_.original_title.lower().contains(== "movie",
_.title_type "Crime") & _.genres.contains("Drama"),
_.genres.contains(
]
)
) expr
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃ tconst ┃ kurt ┃ average_rating ┃ num_votes ┃ title_type ┃ primary_title ┃ original_title ┃ is_adult ┃ start_year ┃ end_year ┃ runtime_minutes ┃ genres ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ string │ float64 │ float64 │ int64 │ string │ string │ string │ int64 │ int64 │ string │ int64 │ string │ ├────────────┼─────────────┼────────────────┼───────────┼────────────┼────────────────────────┼────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼────────────────────┤ │ tt0250404 │ 1127.262502 │ 6.5 │ 252 │ movie │ Godfather │ Godfather │ 0 │ 1992 │ NULL │ NULL │ Crime,Drama │ │ tt13130308 │ 1127.262502 │ 5.2 │ 7543 │ movie │ Godfather │ Godfather │ 0 │ 2022 │ NULL │ 157 │ Action,Crime,Drama │ │ tt0458027 │ 1127.262502 │ 3.5 │ 26 │ movie │ Mumbai Godfather │ Mumbai Godfather │ 0 │ 2005 │ NULL │ 110 │ Action,Crime,Drama │ │ tt0068646 │ 1127.262502 │ 9.2 │ 2001286 │ movie │ The Godfather │ The Godfather │ 0 │ 1972 │ NULL │ 175 │ Crime,Drama │ │ tt0071562 │ 1127.262502 │ 9.0 │ 1357703 │ movie │ The Godfather Part II │ The Godfather Part II │ 0 │ 1974 │ NULL │ 202 │ Crime,Drama │ │ tt0074412 │ 1127.262502 │ 5.2 │ 1785 │ movie │ Disco Godfather │ Disco Godfather │ 0 │ 1979 │ NULL │ 98 │ Action,Crime,Drama │ │ tt0099674 │ 1127.262502 │ 7.6 │ 422466 │ movie │ The Godfather Part III │ The Godfather Part III │ 0 │ 1990 │ NULL │ 162 │ Crime,Drama │ └────────────┴─────────────┴────────────────┴───────────┴────────────┴────────────────────────┴────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴────────────────────┘