Ibis overview

what

Ibis is a Python frontend for:

  • exploratory data analysis (EDA)
  • analytics
  • data engineering
  • machine learning

demo

ibis-analytics

Analyzing and predicting on 10M+ rows from 4+ sources.

why

dataframe lore

Dataframes first appeared in the S programming language, then evolved into the R calculator programming language.

Then pandas perfected the dataframe in Python…or did it?

Since, dozens of Python dataframes libraries have come and gone…

The pandas API remains the de facto standard for dataframes in Python (alongside PySpark), but it doesn’t scale.

This leads to data scientists frequently “throwing their work over the wall” to data engineers and ML engineers.

But what if there were a new standard?

Ibis origins

…in 2015, I started the Ibis project…to create a pandas-friendly deferred expression system for static analysis and compilation [of] these types of [query planned, multicore execution] operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.

dataframe history

  • pandas (2008): dataframes in Python
  • Spark (2009): distributed dataframes with PySpark
  • Dask (2014): distributed dataframes with Python
  • dplyr (2014): dataframes in R with SQL-like syntax
  • Ibis (2015): dataframes in Python with SQL-like syntax
  • cuDF (2017): pandas on GPUs
  • Modin (2018): pandas on Ray/Dask
  • Koalas (2019): pandas on Spark
  • Polars (2020): multicore dataframes in Python

two world problem

SQL:

Python:

two world problem

SQL:

  • databases & tables

Python:

  • files & dataframes

two world problem

SQL:

  • databases & tables
  • analytics

Python:

  • files & dataframes
  • data science

two world problem

SQL:

  • databases & tables
  • analytics
  • metrics

Python:

  • files & dataframes
  • data science
  • statistics

two world problem

SQL:

  • databases & tables
  • analytics
  • metrics
  • dashboards

Python:

  • files & dataframes
  • data science
  • statistics
  • notebooks

two world problem

SQL:

  • databases & tables
  • analytics
  • metrics
  • dashboards

Python:

  • files & dataframes
  • data science
  • statistics
  • notebooks

Ibis bridges the gap.

database history

  • they got faster

DuckDB

import ibis
con = ibis.duckdb.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

An embeddable, zero-dependency, C++ SQL database engine.

DataFusion

import ibis
con = ibis.datafusion.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A Rust SQL query engine.

ClickHouse

import ibis
con = ibis.clickhouse.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A C++ column-oriented database management system.

Polars

import ibis
con = ibis.polars.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A Rust DataFrame library.

BigQuery

import ibis
con = ibis.bigquery.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A serverless, highly scalable, and cost-effective cloud data warehouse.

Snowflake

import ibis
con = ibis.snowflake.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A cloud data platform.

Oracle

import ibis
con = ibis.oracle.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A relational database management system.

Spark

import ibis
con = ibis.pyspark.connect(session)
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A unified analytics engine for large-scale data processing.

Trino

import ibis
con = ibis.trino.connect()
penguins = con.table("penguins")
penguins.group_by(["species", "island"]).agg(ibis._.count().name("count"))

A distributed SQL query engine.

and more!

  • SQLite
  • PostgreSQL
  • MySQL
  • MSSQL
  • Druid
  • pandas
  • Impala
  • Dask

New backends are easy to add!*

*usually

how

try it out now

Install:

pip install 'ibis-framework[duckdb]'

Then run:

import ibis

ibis.options.interactive = True

t = ibis.examples.penguins.fetch()

t

questions?

the end