Backend agnostic arrays

arrays
bigquery
blog
cloud
duckdb
portability
Author

Phillip Cloud

Published

January 19, 2024

Introduction

This is a redux of a previous post showing Ibis’s portability in action.

Ibis is portable across complex operations and backends of very different scales and deployment models!

Results differ slightly between BigQuery and DuckDB

The datasets used in each backend are slightly different.

I opted to avoid ETL for the BigQuery backend by reusing the Google-provided IMDB dataset.

The tradeoff is the slight discrepancy in results.

Basics

We’ll start with from ibis.interactive import * for maximum convenience.

from ibis.interactive import *
1
from ibis.interactive import * imports Ibis APIs into the global namespace and enables interactive mode.

Connect to your database

ddb = ibis.connect("duckdb://")
ddb.create_table(
    "name_basics", ex.imdb_name_basics.fetch(backend=ddb).rename("snake_case")
)
ddb.create_table(
    "title_basics", ex.imdb_title_basics.fetch(backend=ddb).rename("snake_case")
)
1
Create a table called name_basics in our DuckDB database using ibis.examples data
2
Create a table called title_basics in our DuckDB database using ibis.examples data
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tconst     title_type  primary_title                                original_title                               is_adult  start_year  end_year  runtime_minutes  genres                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringstringint64int64stringint64string                   │
├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤
│ tt0000001short     Carmencita                                 Carmencita                                 01894NULL1Documentary,Short        │
│ tt0000002short     Le clown et ses chiens                     Le clown et ses chiens                     01892NULL5Animation,Short          │
│ tt0000003short     Pauvre Pierrot                             Pauvre Pierrot                             01892NULL4Animation,Comedy,Romance │
│ tt0000004short     Un bon bock                                Un bon bock                                01892NULL12Animation,Short          │
│ tt0000005short     Blacksmith Scene                           Blacksmith Scene                           01893NULL1Comedy,Short             │
│ tt0000006short     Chinese Opium Den                          Chinese Opium Den                          01894NULL1Short                    │
│ tt0000007short     Corbett and Courtney Before the KinetographCorbett and Courtney Before the Kinetograph01894NULL1Short,Sport              │
│ tt0000008short     Edison Kinetoscopic Record of a Sneeze     Edison Kinetoscopic Record of a Sneeze     01894NULL1Documentary,Short        │
│ tt0000009movie     Miss Jerry                                 Miss Jerry                                 01894NULL45Romance                  │
│ tt0000010short     Leaving the Factory                        La sortie de l'usine Lumière à Lyon        01895NULL1Documentary,Short        │
│                         │
└───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘
bq = ibis.connect("bigquery://ibis-gbq")
bq.set_database("bigquery-public-data.imdb")
1
Google provides a public BigQuery dataset for IMDB data.

Let’s pull out the name_basics table, which contains names and metadata about people listed on IMDB. We’ll call this ents (short for entities), and remove some columns we won’t need:

ddb_ents = ddb.tables.name_basics.drop("birth_year", "death_year")
ddb_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringstring                                  │
├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤
│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0053137,tt0072308,tt0045537,tt0050419 │
│ nm0000002Lauren Bacall  actress,soundtrack                 tt0037382,tt0117057,tt0075213,tt0038355 │
│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0057345,tt0054452,tt0049189,tt0056404 │
│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0078723,tt0077975,tt0080455 │
│ nm0000005Ingmar Bergman writer,director,actor              tt0083922,tt0069467,tt0050976,tt0050986 │
│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0038109,tt0036855,tt0034583,tt0038787 │
│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0034583,tt0042593,tt0043265 │
│ nm0000008Marlon Brando  actor,soundtrack,director          tt0068646,tt0070849,tt0078788,tt0047296 │
│ nm0000009Richard Burton actor,soundtrack,producer          tt0057877,tt0059749,tt0061184,tt0087803 │
│ nm0000010James Cagney   actor,soundtrack,director          tt0042041,tt0035575,tt0029870,tt0031867 │
│                                        │
└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘
bq_ents = bq.tables.name_basics.drop("birth_year", "death_year")
bq_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringstring                                  │
├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤
│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0072308,tt0053137,tt0031983,tt0050419 │
│ nm0000002Lauren Bacall  actress,soundtrack                 tt0038355,tt0075213,tt0117057,tt0037382 │
│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0049189,tt0054452,tt0056404,tt0057345 │
│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0078723,tt0080455,tt0077975 │
│ nm0000005Ingmar Bergman writer,director,actor              tt0050976,tt0083922,tt0069467,tt0050986 │
│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0034583,tt0038787,tt0038109,tt0036855 │
│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0043265,tt0034583,tt0042593 │
│ nm0000008Marlon Brando  actor,soundtrack,director          tt0068646,tt0070849,tt0047296,tt0078788 │
│ nm0000009Richard Burton actor,soundtrack,producer          tt0061184,tt0087803,tt0057877,tt0059749 │
│ nm0000010James Cagney   actor,soundtrack,director          tt0042041,tt0029870,tt0031867,tt0035575 │
│                                        │
└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘

Splitting strings into arrays

We can see that known_for_titles looks sort of like an array, so let’s call the split method on that column and replace the existing column:

ddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(","))
ddb_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>                      │
├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0053137', 'tt0072308', ... +2] │
│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0037382', 'tt0117057', ... +2] │
│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0057345', 'tt0054452', ... +2] │
│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman writer,director,actor              ['tt0083922', 'tt0069467', ... +2] │
│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0038109', 'tt0036855', ... +2] │
│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0034583', ... +2] │
│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0057877', 'tt0059749', ... +2] │
│ nm0000010James Cagney   actor,soundtrack,director          ['tt0042041', 'tt0035575', ... +2] │
│                                   │
└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘
bq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(","))
bq_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>                      │
├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0072308', 'tt0053137', ... +2] │
│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0038355', 'tt0075213', ... +2] │
│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0049189', 'tt0054452', ... +2] │
│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman writer,director,actor              ['tt0050976', 'tt0083922', ... +2] │
│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0034583', 'tt0038787', ... +2] │
│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0043265', ... +2] │
│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0061184', 'tt0087803', ... +2] │
│ nm0000010James Cagney   actor,soundtrack,director          ['tt0042041', 'tt0029870', ... +2] │
│                                   │
└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘

Similarly for primary_profession, since people involved in show business often have more than one responsibility on a project:

ddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(","))
bq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(","))

Array length

Let’s see how many titles each entity is known for, and then show the five people with the largest number of titles they’re known for.

This is computed using the length API on array expressions:

(
    ddb_ents.select("primary_name", num_titles=_.known_for_titles.length())
    .order_by(_.num_titles.desc())
    .limit(5)
)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ primary_name      num_titles ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ stringint64      │
├──────────────────┼────────────┤
│ Alex Koenigsmark5 │
│ Carrie Schnelker5 │
│ Henry Townsend  5 │
│ Sally Sun       5 │
│ Matthew Kavuma  5 │
└──────────────────┴────────────┘
(
    bq_ents.select("primary_name", num_titles=_.known_for_titles.length())
    .order_by(_.num_titles.desc())
    .limit(5)
)
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ primary_name         num_titles ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ stringint64      │
├─────────────────────┼────────────┤
│ José Jaime Espinosa6 │
│ Paul Winter        6 │
│ Nicolas Bernier    6 │
│ Chris Estrada      6 │
│ Tsuyotake Matsuda  5 │
└─────────────────────┴────────────┘

It seems like the length of the known_for_titles might be capped at some small number!

Index

We can see the position of "actor" or "actress" in primary_professions:

ddb_ents.primary_profession.index("actor")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayPosition(primary_profession, 'actor') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                                      │
├────────────────────────────────────────────┤
│                                          1 │
│                                         -1 │
│                                         -1 │
│                                          0 │
│                                          2 │
│                                         -1 │
│                                          0 │
│                                          0 │
│                                          0 │
│                                          0 │
│                                           │
└────────────────────────────────────────────┘
ddb_ents.primary_profession.index("actress")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayPosition(primary_profession, 'actress') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                                        │
├──────────────────────────────────────────────┤
│                                           -1 │
│                                            0 │
│                                            0 │
│                                           -1 │
│                                           -1 │
│                                            0 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                             │
└──────────────────────────────────────────────┘
bq_ents.primary_profession.index("actor")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayPosition(primary_profession, 'actor') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                                      │
├────────────────────────────────────────────┤
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                           │
└────────────────────────────────────────────┘
bq_ents.primary_profession.index("actress")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayPosition(primary_profession, 'actress') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                                        │
├──────────────────────────────────────────────┤
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                           -1 │
│                                             │
└──────────────────────────────────────────────┘

A return value of -1 indicates that "actor" is not present in the value.

Let’s look for entities that are not primarily actors.

We can do this using the index method by checking whether the positions of the strings "actor" or "actress" are both greater than 0:

actor_index = ddb_ents.primary_profession.index("actor")
actress_index = ddb_ents.primary_profession.index("actress")

ddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)
ddb_not_primarily_acting.mean()

0.0
actor_index = bq_ents.primary_profession.index("actor")
actress_index = bq_ents.primary_profession.index("actress")

bq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)
bq_not_primarily_acting.mean()

0.0

Who are they?

ddb_ents[ddb_not_primarily_acting].order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst  primary_name  primary_profession  known_for_titles ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
└────────┴──────────────┴────────────────────┴──────────────────┘
bq_ents[bq_not_primarily_acting].order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst  primary_name  primary_profession  known_for_titles ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
└────────┴──────────────┴────────────────────┴──────────────────┘

It’s not 100% clear whether the order of elements in primary_profession matters here.

Containment

We can get people who are listed as actors or actresses using contains:

ddb_non_actors = bq_ents[
    ~_.primary_profession.contains("actor") & ~_.primary_profession.contains("actress")
]
ddb_non_actors.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤
│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt8847712', 'tt0091763', ... +2] │
│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0077269', 'tt0117731', ... +2] │
│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0054215', 'tt0052357', ... +2] │
│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0177971', 'tt0120338', ... +2] │
│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0120663', 'tt0066921', ... +2] │
│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0080979', 'tt0089881', ... +2] │
│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0383216', 'tt0054698', ... +2] │
│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0049408', 'tt0434409', ... +2] │
│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0071562', 'tt0056801', ... +2] │
│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0052618', 'tt0038109', ... +2] │
│                                   │
└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘
bq_non_actors = bq_ents[
    ~_.primary_profession.contains("actor") & ~_.primary_profession.contains("actress")
]
bq_non_actors.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤
│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt8847712', 'tt0091763', ... +2] │
│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0077269', 'tt0117731', ... +2] │
│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0054215', 'tt0052357', ... +2] │
│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0177971', 'tt0120338', ... +2] │
│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0120663', 'tt0066921', ... +2] │
│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0080979', 'tt0089881', ... +2] │
│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0383216', 'tt0054698', ... +2] │
│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0049408', 'tt0434409', ... +2] │
│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0071562', 'tt0056801', ... +2] │
│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0052618', 'tt0038109', ... +2] │
│                                   │
└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘

Element removal

We can remove elements from arrays too.

remove() does not mutate the underlying data

Let’s see who only has “actor” in the list of their primary professions:

ddb_ents.filter(
    [
        _.primary_profession.length() > 0,
        _.primary_profession.remove("actor").length() == 0,
        _.primary_profession.remove("actress").length() == 0,
    ]
).order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst  primary_name  primary_profession  known_for_titles ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
└────────┴──────────────┴────────────────────┴──────────────────┘
bq_ents.filter(
    [
        _.primary_profession.length() > 0,
        _.primary_profession.remove("actor").length() == 0,
        _.primary_profession.remove("actress").length() == 0,
    ]
).order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst  primary_name  primary_profession  known_for_titles ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
└────────┴──────────────┴────────────────────┴──────────────────┘

Slicing with square-bracket syntax

Let’s remove everyone’s first profession from the list, but only if they have more than one profession listed:

ddb_ents[_.primary_profession.length() > 1].mutate(
    primary_profession=_.primary_profession[1:],
).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │
│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │
│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['director', 'actor']['tt0083922', 'tt0069467', ... +2] │
│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │
│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │
│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │
│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
bq_ents[_.primary_profession.length() > 1].mutate(
    primary_profession=_.primary_profession[1:],
).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0072308', 'tt0053137', ... +2] │
│ nm0000002Lauren Bacall  ['soundtrack']['tt0038355', 'tt0075213', ... +2] │
│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['director', 'actor']['tt0050976', 'tt0083922', ... +2] │
│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0034583', 'tt0038787', ... +2] │
│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0043265', ... +2] │
│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0061184', 'tt0087803', ... +2] │
│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0029870', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘

Set operations and sorting

Treating arrays as sets is possible with the union and intersect APIs.

Let’s take a look at intersect.

Intersection

Let’s see if we can use array intersection to figure which actors share known-for titles and sort the result:

left = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)
right = left.view()
shared_titles = (
    left
    .join(right, left.nconst != right.nconst)
    .select(
        s.startswith("known_for_titles"),
        left_name="primary_name",
        right_name="primary_name_right",
    )
    .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)
    .group_by(name="left_name")
    .agg(together_with=_.right_name.collect())
    .mutate(together_with=_.together_with.unique().sort())
)
shared_titles
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name                  together_with                                 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringarray<string>                                 │
├──────────────────────┼───────────────────────────────────────────────┤
│ Ava Gardner         ['Ernest Gold', 'Fred Astaire']               │
│ Cyd Charisse        ['Fred Astaire']                              │
│ John Landis         ['Dan Aykroyd', 'Dick Ziker', ... +14]        │
│ Michael Curtiz      ['Alan Hale', 'Ann Blyth', ... +19]           │
│ Francis Ford Coppola['Abe Vigoda', 'Al Pacino', ... +19]          │
│ Bernardo Bertolucci ['Armand Abplanalp', 'James Acheson', ... +3] │
│ Karl Malden         ['Abraxas Aaran', 'Alex North', ... +14]      │
│ Richard Conte       ['Abe Vigoda', 'Al Pacino', ... +9]           │
│ George Orwell       ['John Hurt', 'Richard Burton']               │
│ Joseph L. Mankiewicz['Alfred Newman', 'Anne Baxter', ... +13]     │
│                                              │
└──────────────────────┴───────────────────────────────────────────────┘
left = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)
right = left.view()
shared_titles = (
    left
    .join(right, left.nconst != right.nconst)
    .select(
        s.startswith("known_for_titles"),
        left_name="primary_name",
        right_name="primary_name_right",
    )
    .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)
    .group_by(name="left_name")
    .agg(together_with=_.right_name.collect())
    .mutate(together_with=_.together_with.unique().sort())
)
shared_titles
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name                     together_with                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringarray<string>                                       │
├─────────────────────────┼─────────────────────────────────────────────────────┤
│ Pavel Vrba             ['F.C. Lokomotiv Moscow', 'Jeffrey Bruma', ... +4]  │
│ Greg Carrolan          ['Al Cambronne', 'Alana Tornello', ... +20]         │
│ Aleksander Parzychowski['Adam Korszun', 'Grzegorz Wawrzenczyk', ... +5]    │
│ James Walt             ['Anton Testino', 'Ben Walanka', ... +10]           │
│ Ellen Dallaglio        ['Antonia Giovanazzi', 'Fra McCann', ... +10]       │
│ Catarina Martins       ['Miguel Oliveira', 'Ricardo Gordon', ... +1]       │
│ Stanislav Sesták       ['Martin Glenn', 'Miso Brecko', ... +6]             │
│ Allison Cabot          ['Brenda Beard', 'Brian Fenmore', ... +16]          │
│ Vasilis Bouzianas      ['Aggelos Kasolas', 'Christos Patriarheas', ... +3] │
│ Marie Muldoon          ['Alan Oxley', 'Andrew Raeber', ... +39]            │
│                                                    │
└─────────────────────────┴─────────────────────────────────────────────────────┘

Advanced operations

Flatten arrays into rows

Thanks to the tireless efforts of the folks working on sqlglot, as of version 7.0.0 Ibis supports unnest for BigQuery!

You can use it standalone on a column expression:

ddb_ents.primary_profession.unnest()
┏━━━━━━━━━━━━━━━━━━━━┓
┃ primary_profession ┃
┡━━━━━━━━━━━━━━━━━━━━┩
│ string             │
├────────────────────┤
│ soundtrack         │
│ actor              │
│ miscellaneous      │
│ actress            │
│ soundtrack         │
│ actress            │
│ soundtrack         │
│ music_department   │
│ actor              │
│ soundtrack         │
│                   │
└────────────────────┘
bq_ents.primary_profession.unnest()
┏━━━━━━━━━━━━━━━━━━━━┓
┃ primary_profession ┃
┡━━━━━━━━━━━━━━━━━━━━┩
│ string             │
├────────────────────┤
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│                   │
└────────────────────┘

You can also use it in select/mutate calls to expand the table accordingly:

ddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>                      │
├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   soundtrack        ['tt0053137', 'tt0072308', ... +2] │
│ nm0000001Fred Astaire   miscellaneous     ['tt0053137', 'tt0072308', ... +2] │
│ nm0000001Fred Astaire   actor             ['tt0053137', 'tt0072308', ... +2] │
│ nm0000002Lauren Bacall  soundtrack        ['tt0037382', 'tt0117057', ... +2] │
│ nm0000002Lauren Bacall  actress           ['tt0037382', 'tt0117057', ... +2] │
│ nm0000003Brigitte Bardotmusic_department  ['tt0057345', 'tt0054452', ... +2] │
│ nm0000003Brigitte Bardotsoundtrack        ['tt0057345', 'tt0054452', ... +2] │
│ nm0000003Brigitte Bardotactress           ['tt0057345', 'tt0054452', ... +2] │
│ nm0000004John Belushi   soundtrack        ['tt0072562', 'tt0078723', ... +2] │
│ nm0000004John Belushi   writer            ['tt0072562', 'tt0078723', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘
bq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>                      │
├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   miscellaneous     ['tt0072308', 'tt0053137', ... +2] │
│ nm0000001Fred Astaire   actor             ['tt0072308', 'tt0053137', ... +2] │
│ nm0000001Fred Astaire   soundtrack        ['tt0072308', 'tt0053137', ... +2] │
│ nm0000002Lauren Bacall  actress           ['tt0038355', 'tt0075213', ... +2] │
│ nm0000002Lauren Bacall  soundtrack        ['tt0038355', 'tt0075213', ... +2] │
│ nm0000003Brigitte Bardotmusic_department  ['tt0049189', 'tt0054452', ... +2] │
│ nm0000003Brigitte Bardotsoundtrack        ['tt0049189', 'tt0054452', ... +2] │
│ nm0000003Brigitte Bardotactress           ['tt0049189', 'tt0054452', ... +2] │
│ nm0000004John Belushi   soundtrack        ['tt0072562', 'tt0078723', ... +2] │
│ nm0000004John Belushi   actor             ['tt0072562', 'tt0078723', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘

Unnesting can be useful when joining nested data.

Here we use unnest to find people known for any of the godfather movies:

basics = ddb.tables.title_basics.filter(
    [
        _.title_type == "movie",
        _.original_title.lower().startswith("the godfather"),
        _.genres.lower().contains("crime"),
    ]
)

ddb_known_for_the_godfather = (
    ddb_ents.mutate(tconst=_.known_for_titles.unnest())
    .join(basics, "tconst")
    .select("primary_title", "primary_name")
    .distinct()
    .order_by(["primary_title", "primary_name"])
)
ddb_known_for_the_godfather
1
Filter the title_basics data set to only the Godfather movies
2
Unnest the known_for_titles array column
3
Join with basics to get movie titles
4
Ensure that each entity is only listed once and sort the results
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title  primary_name        ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringstring              │
├───────────────┼─────────────────────┤
│ The GodfatherA. Emmett Adams     │
│ The GodfatherAbe Vigoda          │
│ The GodfatherAl Lettieri         │
│ The GodfatherAl Martino          │
│ The GodfatherAl Pacino           │
│ The GodfatherAlbert S. Ruddy     │
│ The GodfatherAlex Rocco          │
│ The GodfatherAndrea Eastman      │
│ The GodfatherAngelo Infanti      │
│ The GodfatherAnna Hill Johnstone │
│                    │
└───────────────┴─────────────────────┘
basics = bq.tables.title_basics.filter(
    [
        _.title_type == "movie",
        _.original_title.lower().startswith("the godfather"),
        _.genres.lower().contains("crime"),
    ]
)

bq_known_for_the_godfather = (
    bq_ents.mutate(tconst=_.known_for_titles.unnest())
    .join(basics, "tconst")
    .select("primary_title", "primary_name")
    .distinct()
    .order_by(["primary_title", "primary_name"])
)
bq_known_for_the_godfather
1
Filter the title_basics data set to only the Godfather movies
2
Unnest the known_for_titles array column
3
Join with basics to get movie titles
4
Ensure that each entity is only listed once and sort the results
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title  primary_name        ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringstring              │
├───────────────┼─────────────────────┤
│ The GodfatherA. Emmett Adams     │
│ The GodfatherAbe Vigoda          │
│ The GodfatherAl Lettieri         │
│ The GodfatherAl Martino          │
│ The GodfatherAl Pacino           │
│ The GodfatherAlbert S. Ruddy     │
│ The GodfatherAlex Rocco          │
│ The GodfatherAndrea Eastman      │
│ The GodfatherAngelo Infanti      │
│ The GodfatherAnna Hill Johnstone │
│                    │
└───────────────┴─────────────────────┘

Let’s summarize by showing how many people are known for each Godfather movie:

ddb_known_for_the_godfather.primary_title.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title           primary_title_count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64               │
├────────────────────────┼─────────────────────┤
│ The Godfather Part II 117 │
│ The Godfather         93 │
│ The Godfather Part III196 │
└────────────────────────┴─────────────────────┘
bq_known_for_the_godfather.primary_title.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title           primary_title_count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64               │
├────────────────────────┼─────────────────────┤
│ The Godfather Part II 114 │
│ The Godfather Part III202 │
│ The Godfather         97 │
└────────────────────────┴─────────────────────┘

Filtering array elements

Filtering array elements can be done with the filter method, which applies a predicate to each array element and returns an array of elements for which the predicate returns True.

This method is similar to Python’s filter function.

Let’s show all people who are neither editors nor actors:

ddb_ents.mutate(
    primary_profession=_.primary_profession.filter(
        lambda pp: ~pp.isin(("actor", "actress", "editor"))
    )
).filter(_.primary_profession.length() > 0).order_by("nconst")
1
This filter call is applied to each array element
2
This filter call is applied to the table
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │
│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │
│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['writer', 'director']['tt0083922', 'tt0069467', ... +2] │
│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │
│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │
│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │
│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
bq_ents.mutate(
    primary_profession=_.primary_profession.filter(
        lambda pp: ~pp.isin(("actor", "actress", "editor"))
    )
).filter(_.primary_profession.length() > 0).order_by("nconst")
1
This filter call is applied to each array element
2
This filter call is applied to the table
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0072308', 'tt0053137', ... +2] │
│ nm0000002Lauren Bacall  ['soundtrack']['tt0038355', 'tt0075213', ... +2] │
│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['writer', 'director']['tt0050976', 'tt0083922', ... +2] │
│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0034583', 'tt0038787', ... +2] │
│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0043265', ... +2] │
│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0061184', 'tt0087803', ... +2] │
│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0029870', ... +2] │
│                                   │
└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘

Applying a function to array elements

You can apply a function to run an ibis expression on each element of an array using the map method.

Let’s normalize the case of primary_profession to upper case:

ddb_ents.mutate(
    primary_profession=_.primary_profession.map(lambda pp: pp.upper())
).filter(_.primary_profession.length() > 0).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0053137', 'tt0072308', ... +2] │
│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0037382', 'tt0117057', ... +2] │
│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0057345', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0083922', 'tt0069467', ... +2] │
│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0038109', 'tt0036855', ... +2] │
│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0034583', ... +2] │
│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0057877', 'tt0059749', ... +2] │
│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0042041', 'tt0035575', ... +2] │
│                                   │
└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘
bq_ents.mutate(
    primary_profession=_.primary_profession.map(lambda pp: pp.upper())
).filter(_.primary_profession.length() > 0).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                      │
├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤
│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0072308', 'tt0053137', ... +2] │
│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0038355', 'tt0075213', ... +2] │
│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0049189', 'tt0054452', ... +2] │
│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0078723', ... +2] │
│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0050976', 'tt0083922', ... +2] │
│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0034583', 'tt0038787', ... +2] │
│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0043265', ... +2] │
│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0068646', 'tt0070849', ... +2] │
│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0061184', 'tt0087803', ... +2] │
│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0042041', 'tt0029870', ... +2] │
│                                   │
└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘

Conclusion

Ibis has a sizable collection of array APIs that work with many different backends and as of version 7.0.0, Ibis supports a much larger set of those APIs for BigQuery!

Check out the API documentation for the full set of available methods.

Try it out, and let us know what you think.

Back to top