Reference built-in functions

Scalar functions

Functions that aren’t exposed in ibis directly can be accessed using the @ibis.udf.scalar.builtin decorator.

Ibis APIs may already exist for your function.

Builtin scalar UDFs are designed to be an escape hatch when Ibis doesn’t have a defined API for a built-in database function.

See the reference documentation for existing APIs.

DuckDB

Ibis doesn’t directly expose many of the DuckDB text similarity functions. Let’s expose the mismatches API.

import ibis
ibis.options.interactive = True

@ibis.udf.scalar.builtin
def mismatches(left: str, right: str) -> int:
    ...

The ... is a visual indicator that the function definition is unknown to Ibis.

Ibis will not execute the function body or otherwise inspect it. Any code you write in the function body will be ignored.

We can now call this function on any ibis expression:

con = ibis.duckdb.connect()
1
Connect to an in-memory DuckDB database
expr = mismatches("duck", "luck")
con.execute(expr)
1

Like any other ibis expression you can inspect the SQL:

ibis.to_sql(expr, dialect="duckdb")
1
The dialect keyword argument must be passed, because we constructed a literal expression which has no backend attached.
SELECT
  MISMATCHES('duck', 'luck') AS "mismatches_0('duck', 'luck')"

Similarly we can expose Duckdb’s jaro_winkler_similarity function. Let’s alias it to jw_sim to illustrate some more of the Ibis udf API:

@ibis.udf.scalar.builtin(name="jaro_winkler_similarity")
def jw_sim(a: str, b: str) -> float:
   ...

Because built-in UDFs are ultimately Ibis expressions, they compose with the rest of the library:

pkgs = ibis.read_parquet(
   "https://storage.googleapis.com/ibis-tutorial-data/pypi/packages.parquet"
)
pandas_ish = pkgs[jw_sim(pkgs.name, "pandas") >= 0.9]
pandas_ish
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name      version            requires_python  yanked   has_binary_wheel  has_vulnerabilities  first_uploaded_at    last_uploaded_at     recorded_at          downloads  scorecard_overall  in_google_assured_oss ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringbooleanbooleanbooleantimestamp(6)timestamp(6)timestamp(6)int32float64boolean               │
├──────────┼───────────────────┼─────────────────┼─────────┼──────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────┼───────────────────┼───────────────────────┤
│ bcpandas2.4.1            >=3.8.1         │ True    │ False            │ False               │ 2023-07-12 06:14:222023-07-12 06:14:232023-07-12 14:31:410NULL │ False                 │
│ espandas1.0.4            ~               │ False   │ False            │ False               │ 2018-12-22 20:52:302018-12-22 20:52:302023-07-12 14:58:4703.6 │ False                 │
│ fpandas 0.5              ~               │ False   │ False            │ False               │ 2020-03-09 02:35:312020-03-09 02:35:312023-07-12 15:04:230NULL │ False                 │
│ h3pandas0.2.4            >=3.6           │ False   │ False            │ False               │ 2023-03-19 17:58:162023-03-19 17:58:162023-07-12 15:10:060NULL │ False                 │
│ ipandas 0.0.1            ~               │ False   │ False            │ False               │ 2019-05-29 18:46:122019-05-29 18:46:122023-07-12 15:15:3403.6 │ False                 │
│ kpandas 0.0.1            >=3.6,<4.0      │ False   │ False            │ False               │ 2019-05-02 18:00:292019-05-02 18:00:312023-07-12 15:20:210NULL │ False                 │
│ mpandas 0.0.2.1          ~               │ False   │ False            │ False               │ 2022-07-03 16:21:212022-07-03 16:21:232023-07-12 15:30:350NULL │ False                 │
│ mtpandas1.14.202306141807>=3.6           │ False   │ False            │ False               │ 2023-06-14 18:08:012023-06-14 18:08:012023-07-12 15:31:0404.6 │ False                 │
│ mypandas0.1.6            >=3.10          │ False   │ False            │ False               │ 2022-10-24 21:01:102022-10-24 21:01:122023-07-12 15:32:040NULL │ False                 │
│ paandas 0.0.3            ~               │ False   │ False            │ False               │ 2022-11-24 06:11:152022-11-24 06:11:172023-07-12 15:43:310NULL │ False                 │
│                      │
└──────────┴───────────────────┴─────────────────┴─────────┴──────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┴───────────────────┴───────────────────────┘

Defining Signatures

Sometimes the signatures of builtin functions are difficult to spell.

Consider a function that computes the length of any array: the elements in the array can be floats, integers, strings and even other arrays. Spelling that type is difficult.

Fortunately, the udf.scalar.builtin decorator only requires you to specify the type of the return value. The type of the function parameters are not required. Thus, this is adequate:

@ibis.udf.scalar.builtin(name="array_length")
def cardinality(arr) -> int:
   ...

We can pass arrays with different element types to our cardinality function:

con.execute(cardinality([1, 2, 3]))
3
con.execute(cardinality(["a", "b"]))
2

When you do not specify input types, Ibis isn’t able to catch typing errors early, and they are only caught during execution. The errors you get back are backend dependent:

con.execute(cardinality("foo"))
BinderException: Binder Error: No function matches the given name and argument types 'array_length(VARCHAR)'. You might need to add explicit type casts.
    Candidate functions:
    array_length(ANY[]) -> BIGINT
    array_length(ANY[], BIGINT) -> BIGINT

LINE 2:   ARRAY_LENGTH('foo') AS "cardinal...
          ^

Here, DuckDB is informing us that the ARRAY_LENGTH function does not accept strings as input.

Aggregate functions

Aggregate functions that aren’t exposed in ibis directly can be accessed using the @ibis.udf.agg.builtin decorator.

Ibis APIs may already exist for your function.

Builtin aggregate UDFs are designed to be an escape hatch when Ibis doesn’t have a defined API for a built-in database function.

See the reference documentation for existing APIs.

Let’s the use the DuckDB backend to demonstrate how to access an aggregate function that isn’t exposed in ibis: kurtosis.

DuckDB

First, define the builtin aggregate function:

@ibis.udf.agg.builtin
def kurtosis(x: float) -> float:
   ...
1
Both the input and return type annotations indicate the element type of the input, not the shape (column or scalar). Aggregations can only be called on column expressions.

One of the powerful features of this API is that you can define your UD(A)Fs at any point during your analysis. You don’t need to connect to the database to define your functions.

Let’s compute the kurtosis of the number of votes across all movies:

from ibis import _

expr = (
   ibis.examples.imdb_title_ratings.fetch()
   .rename("snake_case")
   .agg(kurt=lambda t: kurtosis(t.num_votes))
)
expr
┏━━━━━━━━━━━━━┓
┃ kurt        ┃
┡━━━━━━━━━━━━━┩
│ float64     │
├─────────────┤
│ 4545.349906 │
└─────────────┘

Since this is an aggregate function, it has the same capabilities as other, builtin aggregates like sum: it can be used in a group by as well as in a window function expression.

Let’s compute kurtosis for all the different types of productions (shorts, movies, TV, etc):

basics = (
   ibis.examples.imdb_title_basics.fetch()
   .rename("snake_case")
   .filter(_.is_adult == 0)
)
ratings = ibis.examples.imdb_title_ratings.fetch().rename("snake_case")

basics_ratings = ratings.join(basics, "tconst")

expr = (
   basics_ratings.group_by("title_type")
   .agg(kurt=lambda t: kurtosis(t.num_votes))
   .order_by(_.kurt.desc())
   .head()
)
expr
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ title_type    kurt        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ stringfloat64     │
├──────────────┼─────────────┤
│ tvEpisode   8043.838209 │
│ tvSeries    4030.938238 │
│ short       3645.730119 │
│ tvMiniSeries1901.614316 │
│ tvMovie     1316.403908 │
└──────────────┴─────────────┘

Similarly for window functions:

expr = (
   basics_ratings.mutate(
      kurt=lambda t: kurtosis(t.num_votes).over(group_by="title_type")
   )
   .relocate("kurt", after="tconst")
   .filter(
      [
         _.original_title.lower().contains("godfather"),
         _.title_type == "movie",
         _.genres.contains("Crime") & _.genres.contains("Drama"),
      ]
   )
)
expr
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ tconst      kurt         average_rating  num_votes  title_type  primary_title           original_title          is_adult  start_year  end_year  runtime_minutes  genres             ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringfloat64float64int64stringstringstringint64int64stringint64string             │
├────────────┼─────────────┼────────────────┼───────────┼────────────┼────────────────────────┼────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼────────────────────┤
│ tt0250404 1090.3638566.5244movie     Godfather             Godfather             01992NULLNULLCrime,Drama        │
│ tt0458027 1090.3638563.727movie     Mumbai Godfather      Mumbai Godfather      02005NULLNULLAction,Crime,Drama │
│ tt131303081090.3638565.27303movie     Godfather             Godfather             02022NULL157Action,Crime,Drama │
│ tt0068646 1090.3638569.21945537movie     The Godfather         The Godfather         01972NULL175Crime,Drama        │
│ tt0071562 1090.3638569.01321642movie     The Godfather Part II The Godfather Part II 01974NULL202Crime,Drama        │
│ tt0074412 1090.3638565.21733movie     Disco Godfather       Disco Godfather       01979NULL98Action,Crime,Drama │
│ tt0099674 1090.3638567.6412936movie     The Godfather Part IIIThe Godfather Part III01990NULL162Crime,Drama        │
└────────────┴─────────────┴────────────────┴───────────┴────────────┴────────────────────────┴────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴────────────────────┘
Back to top