Testing 20 databases on every commit

Phillip Cloud

2024-08-14

Let’s all stand!

Sit if you work with…

  • 0 DBs ✅
  • 1 DB 😇
  • 2 DBs 😬
  • 3+ DBs 😱

I feel your pain.

Who?

Me

  • Phillip Cloud
  • Ibis project
  • Voltron Data
  • Data tools for 10+ years

Ever needed to test a complex system?

Maybe this is you

Or this

Or maybe even this

A complex system: Ibis

What’s Ibis?

  • Python library
  • Exploratory data analysis
  • Data engineering
  • ML preprocessing

dbplyr, but Python

One API, 20+ backends

con = ibis.connect("duckdb://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("polars://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("datafusion://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘
con = ibis.connect("pyspark://")
t = con.read_parquet("penguins.parquet")
t.group_by("species", "island").agg(count=t.count()).order_by("count")
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Adelie   Biscoe   44 │
│ Adelie   Torgersen52 │
│ Adelie   Dream    56 │
│ ChinstrapDream    68 │
│ Gentoo   Biscoe   124 │
└───────────┴───────────┴───────┘

Why is this hard to test?

By the numbers

Backends

  • 17 SQL
  • 3 non-SQL
  • 2 cloud

Engines + APIs

  • 9 distributed SQL
  • 3 dataframe
  • oldest: ~45 years 👀
  • youngest: ~2 years

Other facts

  • Latency is variable
  • Deployment models vary

Feature development

Bit of a pickle

How

High level

Goal: fast iteration

  • fast env setup (dependency management)
  • fast(ish) tests (test-running library)
  • high job concurrency (ci/provider)
  • easy to run: dev speed (just)

CI must complete “quickly”

Tools: overview

  • 📦 deps: poetry
  • 🖥️ ci: GitHub Actions
  • 🦁 “big” backends: docker
  • 🐱 “small” backends: no special tx (duckdb, polars)
  • 🏃 tasks: just (e.g.: just up postgres)

Tools: poetry

  • Env setup must be fast: no constraint solving
  • Poetry is one way; there are others
  • Get yourself a lockfile
  • Downsides?

Are you doing that now

Tools: docker

  • Do you use it locally?
  • Use health checks; “dumb” ones are fine
  • Make it easy for devs to use

Tools: GitHub Actions

  • Pay for the the Teams plan to get more concurrency
  • Automate dependency updates

GHA limits

Ibis CI cost

How does this stack up?

Terminology

Job
a set of commands
my_job:
  - run: pip install ibis-framework
  - run: just ci-check -m ${{ matrix.backend.name }}
  - run: coverage upload
Workflow
A collection of jobs, one .yml file
name: Backends

my_job:
  - run: ...
my_other_job:
  - run: ...

Job metrics

We’ve added 3 or 4 new backends since the switch

Workflow metrics

Queue time and workflow duration

Workflow metrics

Workflow metrics

  • 🟢 Queues + workflows correlated
  • 🟡 Queues slow + workflows fast: not enough concurrency
  • 🟡 Queues fast + workflows slow: jobs doing too much
  • 🔴 Queues slow + workflows slow: hard to say

Summary

  • Testing complex projects is possible
  • Use docker for dev and prod
  • Don’t SAT solve in CI
  • Track CI run durations, workflow metrics
  • Spend time on dev ex

Questions?