2024-07-24
“The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future” by Wes McKinney:
pandas solved many problems that database systems also solve, but almost no one in the data science ecosystem had the expertise to build a data frame library using database techniques. Eagerly-evaluated APIs (as opposed to “lazy” ones) make it more difficult to do efficient “query” planning and execution. Data interoperability with other systems is always going to be painful…
“The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future” by Wes McKinney:
…unless faster, more efficient “standards” for interoperability are created.
“The Composable Codex” by Voltron Data:
“The Composable Codex” by Voltron Data:
Efficiency:
Choose your stack:
UI:
Execution engine:
Storage:
Additionally, choose tools for:
Orchestration:
Ingestion:
Visualization:
Dashboarding:
Testing:
CLI:
development to production with the same API
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Adelie │ Biscoe │ 44 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Dream │ 56 │ │ Chinstrap │ Dream │ 68 │ │ Gentoo │ Biscoe │ 124 │ └───────────┴───────────┴───────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Adelie │ Biscoe │ 44 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Dream │ 56 │ │ Chinstrap │ Dream │ 68 │ │ Gentoo │ Biscoe │ 124 │ └───────────┴───────────┴───────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Adelie │ Biscoe │ 44 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Dream │ 56 │ │ Chinstrap │ Dream │ 68 │ │ Gentoo │ Biscoe │ 124 │ └───────────┴───────────┴───────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Adelie │ Biscoe │ 44 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Dream │ 56 │ │ Chinstrap │ Dream │ 68 │ │ Gentoo │ Biscoe │ 124 │ └───────────┴───────────┴───────┘
Ibis compiles down to SQL or dataframe code:
For SQL backends, inspect the SQL generated by Ibis:
For SQL backends, mix SQL and Python:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Adelie │ Dream │ 56 │ │ Gentoo │ Biscoe │ 124 │ │ Chinstrap │ Dream │ 68 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Biscoe │ 44 │ └───────────┴───────────┴───────┘
For SQL backends, mix SQL and Python:
sql = """
SELECT
species,
island,
count(*) as count
FROM penguins
GROUP BY 1, 2
"""
t.sql(sql).order_by(ibis.desc("count"))
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ count ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ ├───────────┼───────────┼───────┤ │ Gentoo │ Biscoe │ 124 │ │ Chinstrap │ Dream │ 68 │ │ Adelie │ Dream │ 56 │ │ Adelie │ Torgersen │ 52 │ │ Adelie │ Biscoe │ 44 │ └───────────┴───────────┴───────┘
ibis-analytics
Analyzing 10M+ rows from 4+ data sources.
Dataframes first appeared in the S
programming language (in 1991!), then evolved into the R
programming language.
Then pandas
perfected the dataframe in Python…or did it?
Since, dozens of Python dataframes libraries have come and gone…
The pandas API remains the de facto standard for dataframes in Python (alongside PySpark), but it doesn’t scale.
This leads to data scientists frequently “throwing their work over the wall” to data engineers and ML engineers.
But what if there were a new standard?
from Apache Arrow and the “10 Things I Hate About pandas” by Wes McKinney
…in 2015, I started the Ibis project…to create a pandas-friendly deferred expression system for static analysis and compilation [of] these types of [query planned, multicore execution] operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.
SQL:
Python:
SQL:
Python:
SQL:
Python:
SQL:
Python:
SQL:
Python:
Python:
SQL:
Ibis bridges the gap.
All Python dataframe libraries that are not Ibis (or SQLFrame) lock you into an execution engine.
Good standards are composable and adopted by competitors.
We see three approaches:
pandas clones:
PySpark clones:
something else:
Ibis brings the best of databases to dataframes.
import ibis
con = ibis.duckdb.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
An embeddable, zero-dependency, C++ SQL database engine.
import ibis
con = ibis.datafusion.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A Rust SQL query engine.
import ibis
con = ibis.clickhouse.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A C++ column-oriented database management system.
import ibis
con = ibis.polars.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A Rust DataFrame library.
import ibis
con = ibis.bigquery.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A serverless, highly scalable, and cost-effective cloud data warehouse.
import ibis
con = ibis.snowflake.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A cloud data platform.
import ibis
con = ibis.oracle.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A relational database management system.
import ibis
con = ibis.pyspark.connect(session)
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A unified analytics engine for large-scale data processing.
import ibis
con = ibis.trino.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(penguins.count().name("count"))
A distributed SQL query engine.
New backends are easy to add!*
*usually
Install:
Then run:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │ ├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤ │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │ │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │ │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │ │ Adelie │ Torgersen │ NULL │ NULL │ NULL │ NULL │ NULL │ 2007 │ │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ male │ 2007 │ │ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ female │ 2007 │ │ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ male │ 2007 │ │ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ │ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ │ … │ … │ … │ … │ … │ … │ … │ … │ └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘