ibis-analytics
Analyzing and predicting on 10M+ rows from 4+ sources.
Dataframes first appeared in the S
programming language, then evolved into the R
calculator programming language.
Then pandas
perfected the dataframe in Python…or did it?
Since, dozens of Python dataframes libraries have come and gone…
The pandas API remains the de facto standard for dataframes in Python (alongside PySpark), but it doesn’t scale.
This leads to data scientists frequently “throwing their work over the wall” to data engineers and ML engineers.
But what if there were a new standard?
from Apache Arrow and the “10 Things I Hate About pandas” by Wes McKinney
…in 2015, I started the Ibis project…to create a pandas-friendly deferred expression system for static analysis and compilation [of] these types of [query planned, multicore execution] operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.
SQL:
Python:
SQL:
Python:
SQL:
Python:
SQL:
Python:
SQL:
Python:
SQL:
Python:
Ibis bridges the gap.
import ibis
con = ibis.duckdb.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
An embeddable, zero-dependency, C++ SQL database engine.
import ibis
con = ibis.datafusion.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A Rust SQL query engine.
import ibis
con = ibis.clickhouse.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A C++ column-oriented database management system.
import ibis
con = ibis.polars.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A Rust DataFrame library.
import ibis
con = ibis.bigquery.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A serverless, highly scalable, and cost-effective cloud data warehouse.
import ibis
con = ibis.snowflake.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A cloud data platform.
import ibis
con = ibis.oracle.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A relational database management system.
import ibis
con = ibis.pyspark.connect(session)
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A unified analytics engine for large-scale data processing.
import ibis
con = ibis.trino.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))
A distributed SQL query engine.
New backends are easy to add!*
*usually
Install: