Working with arrays in Google BigQuery

blog
bigquery
arrays
cloud
Author

Phillip Cloud

Published

September 12, 2023

Introduction

Ibis and BigQuery have worked well together for years.

In Ibis 7.0.0, they work even better together with the addition of array functionality for BigQuery.

Let’s look at some examples using BigQuery’s IMDB sample data.

Basics

First we’ll connect to BigQuery and pluck out a table to work with.

We’ll start with from ibis.interactive import * for maximum convenience.

from ibis.interactive import *

con = ibis.connect("bigquery://ibis-gbq")
con.set_database("bigquery-public-data.imdb")
1
from ibis.interactive import * imports Ibis APIs into the global namespace and enables interactive mode.
2
Connect to Google BigQuery. Compute (but not storage) is billed to the project you connect to–ibis-gbq in this case.
3
Set the database to the project and dataset that we will use for analysis.

Let’s look at the tables in this dataset:

con.tables
Tables
------
- name_basics
- reviews
- title_akas
- title_basics
- title_crew
- title_episode
- title_principals
- title_ratings

Let’s pull out the name_basics table, which contains names and metadata about people listed on IMDB. We’ll call this ents (short for entities), and remove some columns we won’t need:

ents = con.tables.name_basics.drop("birth_year", "death_year")
ents
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name       primary_profession  known_for_titles    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringstring              │
├───────────┼───────────────────┼────────────────────┼─────────────────────┤
│ nm7195872Amanda Goetz     NULLtt4529508           │
│ nm7204100Overload         NULLtt4828308,tt4538296 │
│ nm7206569Carl Winter      NULLNULL                │
│ nm7208626Doug Goodin      NULLNULL                │
│ nm7222505Rickard Finndahl NULLtt4519546           │
│ nm7226759Kenneth Bell     NULLtt3545908           │
│ nm7227158Savannah Gardner NULLtt4028790           │
│ nm7246216Elisabeth HofmannNULLtt4586074           │
│ nm7253303Wisda Febriyanti NULLtt4594232           │
│ nm7255948Charles Myers    NULLtt2396758           │
│                    │
└───────────┴───────────────────┴────────────────────┴─────────────────────┘

Splitting strings into arrays

We can see that known_for_titles looks sort of like an array, so let’s call the split method on that column and replace the existing column:

ents = ents.mutate(known_for_titles=_.known_for_titles.split(","))
ents
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name       primary_profession  known_for_titles           ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>              │
├───────────┼───────────────────┼────────────────────┼────────────────────────────┤
│ nm7195872Amanda Goetz     NULL['tt4529508']              │
│ nm7204100Overload         NULL['tt4828308', 'tt4538296'] │
│ nm7206569Carl Winter      NULL[]                         │
│ nm7208626Doug Goodin      NULL[]                         │
│ nm7222505Rickard Finndahl NULL['tt4519546']              │
│ nm7226759Kenneth Bell     NULL['tt3545908']              │
│ nm7227158Savannah Gardner NULL['tt4028790']              │
│ nm7246216Elisabeth HofmannNULL['tt4586074']              │
│ nm7253303Wisda Febriyanti NULL['tt4594232']              │
│ nm7255948Charles Myers    NULL['tt2396758']              │
│                           │
└───────────┴───────────────────┴────────────────────┴────────────────────────────┘

Similarly for primary_profession, since people involved in show business often have more than one responsibility on a project:

ents = ents.mutate(primary_profession=_.primary_profession.split(","))

Array length

Let’s see how many titles each entity is known for, and then show the five people with the largest number of titles they’re known for:

This is computed using the length API on array expressions:

(
    ents.select("primary_name", num_titles=_.known_for_titles.length())
    .order_by(_.num_titles.desc())
    .limit(5)
)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ primary_name      num_titles ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ stringint64      │
├──────────────────┼────────────┤
│ Sally Sun       5 │
│ Matthew Kavuma  5 │
│ Henry Townsend  5 │
│ Alex Koenigsmark5 │
│ Carrie Schnelker5 │
└──────────────────┴────────────┘

It seems like the length of the known_for_titles might be capped at five!

Index

We can see the position of "actor" in primary_professions:

ents.primary_profession.index("actor")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayPosition(primary_profession, 'actor') ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ int64                                      │
├────────────────────────────────────────────┤
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                         -1 │
│                                           │
└────────────────────────────────────────────┘

A return value of -1 indicates that "actor" is not present in the value:

Let’s look for entities that are not primarily actors:

We can do this using the index method by checking whether the position of the string "actor" is greater than zero:

actor_index = ents.primary_profession.index("actor")
not_primarily_actors = actor_index > 0
not_primarily_actors.mean()
1
The average of a bool column gives the percentage of True values

0.019474437073314168

Who are they?

ents[not_primarily_actors]
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst      primary_name        primary_profession   known_for_titles                    ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                       │
├────────────┼────────────────────┼─────────────────────┼─────────────────────────────────────┤
│ nm2231782 Rene Tovar        ['legal', 'actor']['tt21996928']                      │
│ nm2250015 Stephen Clark     ['legal', 'actor']['tt1452628', 'tt14372154', ... +2] │
│ nm0352162 Brett Haber       ['legal', 'actor']['tt1720280', 'tt10928526', ... +2] │
│ nm12169237Endi Ndini        ['editor', 'actor']['tt4557810', 'tt14137514', ... +2] │
│ nm14475156Colby White       ['editor', 'actor']['tt26313337']                      │
│ nm8979480 Bartosz Strusewicz['editor', 'actor']['tt6685946', 'tt7334964', ... +2]  │
│ nm3116354 Robert Marquis    ['editor', 'actor']['tt8376014', 'tt1283513']          │
│ nm9962305 Vino Domi         ['editor', 'actor']['tt8680966', 'tt8669176']          │
│ nm10346617Lucas Oliveira    ['editor', 'actor']['tt9483226', 'tt7216954', ... +1]  │
│ nm7206820 Prince Sethi      ['editor', 'actor']['tt14396686', 'tt4219300']         │
│                                    │
└────────────┴────────────────────┴─────────────────────┴─────────────────────────────────────┘

It’s not 100% clear whether the order of elements in primary_profession matters here.

Containment

We can get people who are not actors using contains:

non_actors = ents[~ents.primary_profession.contains("actor")]
non_actors
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst      primary_name           primary_profession  known_for_titles ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
├────────────┼───────────────────────┼────────────────────┼──────────────────┤
│ nm13613518Silvia Vannini       ['legal'][]               │
│ nm11482673Umit Yildirim        ['legal'][]               │
│ nm14796117Kendall Jackson      ['legal'][]               │
│ nm3922637 Michael J. Douglas   ['legal'][]               │
│ nm5249145 Christopher Addy     ['legal'][]               │
│ nm9235293 Baolu Lan            ['legal'][]               │
│ nm14560328Jean Paul S Voilleque['legal'][]               │
│ nm11250663Kelly D. Shapiro     ['legal'][]               │
│ nm11355058Sameer Oberoi        ['legal'][]               │
│ nm8655635 James Madison        ['legal'][]               │
│                 │
└────────────┴───────────────────────┴────────────────────┴──────────────────┘

Element removal

We can remove elements from arrays too.

remove() does not mutate the underlying data

Let’s see who only has “actor” in the list of their primary professions:

ents.filter(
    [
        _.primary_profession.length() > 0,
        _.primary_profession.remove("actor").length() == 0,
    ]
)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name        primary_profession  known_for_titles                    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                       │
├───────────┼────────────────────┼────────────────────┼─────────────────────────────────────┤
│ nm7217990Jay Isbell        ['actor']['tt4450682']                       │
│ nm7218053Eric Crowell      ['actor']['tt4500196']                       │
│ nm7218081John Wyman        ['actor']['tt4500196']                       │
│ nm7223556Daniel Hope       ['actor']['tt4558584', 'tt9089514']          │
│ nm7223623Marcus Troy       ['actor']['tt0120660', 'tt10914400', ... +2] │
│ nm7241836Havár Csongor     ['actor']['tt4580414']                       │
│ nm7242608Seigô Uetaki      ['actor']['tt4581192']                       │
│ nm7245253Mahmoud El Faituri['actor']['tt2849138']                       │
│ nm7254729Tom Keesey        ['actor']['tt0924155', 'tt0924156', ... +2]  │
│ nm7280985Gabriel Garcia    ['actor']['tt4629714']                       │
│                                    │
└───────────┴────────────────────┴────────────────────┴─────────────────────────────────────┘

Slicing with square-bracket syntax

Let’s remove everyone’s first profession from the list, but only if they have more than one profession listed:

ents[_.primary_profession.length() > 1].mutate(
    primary_profession=_.primary_profession[1:],
)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst      primary_name            primary_profession  known_for_titles           ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>              │
├────────────┼────────────────────────┼────────────────────┼────────────────────────────┤
│ nm3146692 Keith Sutton          ['legal']['tt0472984']              │
│ nm2974992 Dávid Farkas          ['legal']['tt0114301']              │
│ nm6819046 Walter Batt           ['legal']['tt8592196']              │
│ nm2231544 Don Steele            ['actor']['tt0818746']              │
│ nm7267327 Navarro Gray          ['actor']['tt1718437', 'tt7945012'] │
│ nm7783929 Christopher T. Connell['editor']['tt7510258', 'tt5262988'] │
│ nm6894982 Jason Robert Moore    ['editor']['tt4177962']              │
│ nm10526935Damien Mota           ['editor']['tt9889740']              │
│ nm7151380 Chris Villa           ['editor']['tt4479468']              │
│ nm7641135 Curt Champagne        ['editor']['tt5097098']              │
│                           │
└────────────┴────────────────────────┴────────────────────┴────────────────────────────┘

Set operations and sorting

Treating arrays as sets is possible with the union and intersect APIs.

Let’s take a look at intersect:

Intersection

Let’s see if we can use array intersection to figure which actors share known-for titles and sort the result:

left = ents.filter(_.known_for_titles.length() > 0).limit(10_000)
right = left.view()
shared_titles = (
    left
    .join(right, left.nconst != right.nconst)
    .select(
        s.startswith("known_for_titles"),
        left_name="primary_name",
        right_name="primary_name_right",
    )
    .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)
    .group_by(name="left_name")
    .agg(together_with=_.right_name.collect())
    .mutate(together_with=_.together_with.unique().sort())
)
shared_titles
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name                   together_with                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringarray<string>                                       │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ Chief Willie Sellars ['Amy Tan', 'Bruce Williams', ... +13]              │
│ Rainer Spanagel      ['Andra Arnicane', 'Aziz Sheikh', ... +9]           │
│ Sam Cooper           ['Amy Tan', 'Bruce Williams', ... +10]              │
│ Michelle Porter      ['Amy Tan', 'Bruce Williams', ... +8]               │
│ Bryan Carter         ['Andrew Bower', 'Austin Murtha', ... +11]          │
│ Timothy Johnson      ['Betsy Blaney', 'Brendan Halko', ... +19]          │
│ Colin McLean         ['Alison Garnham', 'Ashley Fox', ... +20]           │
│ Jessica Williams     ['Austin Williams']                                 │
│ Leighann Falcon      ['Aaron Green', 'Alejandro Garza y Garza', ... +71] │
│ Oldham Tuneless Choir['Alex Matvienko', 'Andy McDonald', ... +38]        │
│                                                    │
└───────────────────────┴─────────────────────────────────────────────────────┘

Advanced operations

Flatten arrays into rows

Thanks to the tireless efforts of the folks working on sqlglot, as of version 7.0.0 Ibis supports unnest for BigQuery!

You can use it standalone on a column expression:

ents.primary_profession.unnest()
┏━━━━━━━━━━━━━━━━━━━━┓
┃ primary_profession ┃
┡━━━━━━━━━━━━━━━━━━━━┩
│ string             │
├────────────────────┤
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│ actor              │
│                   │
└────────────────────┘

You can also use it in select/mutate calls to expand the table accordingly:

ents.mutate(primary_profession=_.primary_profession.unnest())
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name        primary_profession  known_for_titles ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringstringarray<string>    │
├───────────┼────────────────────┼────────────────────┼──────────────────┤
│ nm7211030Josh Berry        actor             ['tt4046896']    │
│ nm7211205Alan Douglas      actor             ['tt0038449']    │
│ nm7213536Wilson Recalde    actor             ['tt2333598']    │
│ nm7214355Julian Owen       actor             ['tt3488298']    │
│ nm7215983Zach Ladouceur    actor             ['tt4546288']    │
│ nm7221941Alain Milani      actor             ['tt4548654']    │
│ nm7225536Gary Flanzer      actor             ['tt4521030']    │
│ nm7236543Bastiaan Schreuderactor             ['tt4506254']    │
│ nm7241255Jared Young       actor             ['tt4579992']    │
│ nm7241835Fagyas Alex       actor             ['tt4580414']    │
│                 │
└───────────┴────────────────────┴────────────────────┴──────────────────┘

Unnesting can be useful when joining nested data.

Here we use unnest to find people known for any of the godfather movies:

basics = con.tables.title_basics.filter(
    [
        _.title_type == "movie",
        _.original_title.lower().startswith("the godfather"),
        _.genres.lower().contains("crime"),
    ]
)

known_for_the_godfather = (
    ents.mutate(tconst=_.known_for_titles.unnest())
    .join(basics, "tconst")
    .select("primary_title", "primary_name")
    .distinct()
    .order_by(["primary_title", "primary_name"])
)
known_for_the_godfather
1
Filter the title_basics data set to only the Godfather movies
2
Unnest the known_for_titles array column
3
Join with basics to get movie titles
4
Ensure that each entity is only listed once and sort the results
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title  primary_name        ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringstring              │
├───────────────┼─────────────────────┤
│ The GodfatherA. Emmett Adams     │
│ The GodfatherAbe Vigoda          │
│ The GodfatherAl Lettieri         │
│ The GodfatherAl Martino          │
│ The GodfatherAl Pacino           │
│ The GodfatherAlbert S. Ruddy     │
│ The GodfatherAlex Rocco          │
│ The GodfatherAndrea Eastman      │
│ The GodfatherAngelo Infanti      │
│ The GodfatherAnna Hill Johnstone │
│                    │
└───────────────┴─────────────────────┘

Let’s summarize by showing how many people are known for each Godfather movie:

known_for_the_godfather.primary_title.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ primary_title           primary_title_count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64               │
├────────────────────────┼─────────────────────┤
│ The Godfather Part III196 │
│ The Godfather Part II 117 │
│ The Godfather         96 │
└────────────────────────┴─────────────────────┘

Filtering array elements

Filtering array elements can be done with the filter method, which applies a predicate to each array element and returns an array of elements for which the predicate returns True.

This method is similar to Python’s filter function.

Let’s show all people who are neither editors nor actors:

ents.mutate(
    primary_profession=_.primary_profession.filter(
        lambda pp: ~pp.isin(("actor", "editor"))
    )
).filter(_.primary_profession.length() > 0)
1
This filter call is applied to each array element
2
This filter call is applied to the table
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ nconst      primary_name        primary_profession  known_for_titles ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>    │
├────────────┼────────────────────┼────────────────────┼──────────────────┤
│ nm14701100Karin Roach       ['legal'][]               │
│ nm3709802 Kristin L. Holland['legal'][]               │
│ nm13336378Rok Salazar       ['legal'][]               │
│ nm7514782 Christopher Spicer['legal'][]               │
│ nm11531194J Manuel          ['legal'][]               │
│ nm9114713 Huy Nguyen        ['legal'][]               │
│ nm2230248 Jeffrey Galen     ['legal'][]               │
│ nm8479496 Fatima Amgane     ['legal'][]               │
│ nm2229345 Harold Brown      ['legal'][]               │
│ nm7383201 Ashley Silver     ['legal'][]               │
│                 │
└────────────┴────────────────────┴────────────────────┴──────────────────┘

Applying a function to array elements

You can apply a function to run an ibis expression on each element of an array using the map method.

Let’s normalize the case of primary_profession to upper case:

ents.mutate(
    primary_profession=_.primary_profession.map(lambda pp: pp.upper())
).filter(_.primary_profession.length() > 0)
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ nconst     primary_name       primary_profession  known_for_titles                    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringarray<string>array<string>                       │
├───────────┼───────────────────┼────────────────────┼─────────────────────────────────────┤
│ nm7199328Renzo Castro     ['ACTOR']['tt4623856', 'tt4494580', ... +1]  │
│ nm7199362Pankaj           ['ACTOR']['tt4367318']                       │
│ nm7200119Thibault Péan    ['ACTOR']['tt4534250']                       │
│ nm7203213Tim Goodman      ['ACTOR']['tt2234701']                       │
│ nm7207130Ruupertti Arponen['ACTOR']['tt0185819', 'tt10628202']         │
│ nm7223822Federico Carghini['ACTOR']['tt1556087']                       │
│ nm7232704Ned Jackson      ['ACTOR']['tt0407361', 'tt20453990']         │
│ nm7238980Rasmus Cassanelli['ACTOR']['tt0782510']                       │
│ nm7240979Changyuan Zhou   ['ACTOR']['tt0311913']                       │
│ nm7242332James Dolbeare   ['ACTOR']['tt10151048', 'tt5769738', ... +2] │
│                                    │
└───────────┴───────────────────┴────────────────────┴─────────────────────────────────────┘

Conclusion

Ibis has a sizable collection of array APIs that work with many different backends and as of version 7.0.0, Ibis supports a much larger set of those APIs for BigQuery!

Check out the API documentation for the full set of available methods.

Try it out, and let us know what you think.

Back to top