Test 20 databases on every commit @ PyCon 2024

It’s not hyperbole

Phillip Cloud

2024-05-19

Who

Me

  • Phillip Cloud
  • Ibis project
  • Voltron Data
  • Data tools for 10+ years

What

Maybe this is you

Or this

Or maybe even this

Not earth shattering

Overview

  • What we learned about maintenance
  • Day to day of 20+ databases
  • Unique challenges

Tools

  • Docker
  • Packaging
  • CI
  • pytest plugins

Overview of Ibis

Ibis is a Python library for:

  • Exploratory data analysis (EDA)
  • Analytics
  • Data engineering
  • ML preprocessing
  • Building your own DataFrame lib

Dev to prod with the same API

One API, 20+ backends

con = ibis.connect("duckdb://")
t = con.read_parquet("penguins.parquet")
t.head(3)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("polars://")
t = con.read_parquet("penguins.parquet")
t.head(3)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("datafusion://")
t = con.read_parquet("penguins.parquet")
t.head(3)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("pyspark://")
t = con.read_parquet("penguins.parquet")
t.head(3)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘

How it works

What’s in an Ibis?

By the numbers

Backends

  • 17 SQL
  • 3 non-SQL
  • 2 cloud

Engines + APIs

  • 9 distributed SQL
  • 3 dataframe
  • oldest: ~45 years 👀
  • youngest: ~2 years

Other facts

  • Latency is variable
  • Deployment models vary

Feature development

Bit of a pickle

How

High level

Goal: fast iteration

  • fast env setup (dependency management)
  • fast(ish) tests (test-running library)
  • high job concurrency (ci/provider)
  • easy to run: dev speed (just)

CI must complete “quickly”

Tools: overview

  • 📦 poetry
  • 🖥️ GitHub Actions
  • 🦁 docker
  • 🐕 docker
  • 🐱 no special tx (duckdb, polars)
  • 🏃 task runner (e.g.: just up postgres)

Tools: poetry

Opinions follow

Opinions herein…

  • Env setup needs to be fast: avoid constraint solving
  • Poetry is one way; there are others
  • Get yourself a lockfile
  • Downsides?

Are you doing that now

This plot

 

We’ve added 3 or 4 new backends since the switch

Tools: docker

  • Sure, docker
  • But, do you to use it locally?
  • Use health checks; “dumb” ones are fine
  • Make it easy for devs to use

Tools: GitHub Actions

I don’t work for GitHub

…even though it might seem like it

  • Pay for the the Teams plan to get more concurrency
  • Automate dependency updates

GHA concurrency limits

Ibis CI cost

pytest

Ibis problems

  • Tests run across ALL backends
  • Backends don’t implement the same stuff
  • Need to know when backend passes
  • Answer questions like: “will it ever blend?”

Markers + hooks

@pytest.mark.never("duckdb")  # never gonna happen
@pytest.mark.notyet("impala")  # might happen
@pytest.mark.notimpl("snowflake")  # ibis devs: do some work
def test_soundex():
    ...

def pytest_ignore_collect(...):
    # pytest -m duckdb: don't collect things that aren't marked duckdb
    ...

pytest plugins you may like

pytest-

  • xdist: make this work if you can
  • randomly: break your bogus stateful assumptions
  • repeat: when randomly exposes your assumptions
  • clarity: readable test failure messages
  • snapshot: better than the giant f-string you just wrote

hypothesis 👈 that too, we don’t use it enough

Why pytest-randomly?

Summary

  • Use docker for dev and prod
  • Lock your dependencies (dev only!)
  • Auto update stuff
  • pytest probably has a thing for that
  • Spend time on dev ex
  • Track CI run durations, look at them too

Questions?