cols = { c: penguins[c] - penguins[c].mean()for c in penguins.columnsif penguins[c].type().is_numeric() and c !="year"}expr = penguins.group_by("species").mutate(**cols).head(5)expr
cols = { c: penguins[c] - penguins[c].mean()for c in penguins.columnsif penguins[c].type().is_numeric() and c !="year"}expr = penguins.group_by("species").mutate(**cols).head(5)
ibis.to_sql(expr)
SELECT"t0"."species","t0"."island","t0"."bill_length_mm"-AVG("t0"."bill_length_mm") OVER (PARTITIONBY"t0"."species"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING) AS"bill_length_mm","t0"."bill_depth_mm"-AVG("t0"."bill_depth_mm") OVER (PARTITIONBY"t0"."species"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING) AS"bill_depth_mm","t0"."flipper_length_mm"-AVG("t0"."flipper_length_mm") OVER (PARTITIONBY"t0"."species"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING) AS"flipper_length_mm","t0"."body_mass_g"-AVG("t0"."body_mass_g") OVER (PARTITIONBY"t0"."species"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING) AS"body_mass_g","t0"."sex","t0"."year"FROM"penguins"AS"t0"LIMIT5
…in 2015, I started the Ibis project…to create a pandas-friendly deferred expression system for static analysis and compilation [of] these types of [query planned, multicore execution] operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.
graph BT
classDef white color:white;
%% graph definition
DatabaseTable --> species
DatabaseTable --> bill_length_mm
bill_length_mm --> Mean
species --> Aggregate
Mean --> Aggregate
%% style
class DatabaseTable white;
class species white;
class bill_length_mm white;
class Mean white;
class Aggregate white;
graph BT
classDef white color:white;
DatabaseTable2[DatabaseTable] --> species2[species]
species2 --> bill_length_mm2[bill_length_mm]
bill_length_mm2 --> Mean2[Mean]
Mean2 --> Aggregate2[Aggregate]
%% style
class DatabaseTable2 white;
class species2 white;
class bill_length_mm2 white;
class Mean2 white;
class Aggregate2 white;
Send to DB via DBAPI: cursor.execute(ibis_generated_sql)
(Heavily) massage the output
Ibis + Streaming
Growth of streaming
Over 70% of Fortune 500 companies have adopted Kafka
54% of Databricks’ customers are using Spark Structured Streaming
The stream processing market is expected to grow at a compound annual growth rate (CAGR) of 21.5% from 2022 to 2028 (IDC)
Batch and streaming
graph LR
subgraph " "
direction LR
A[data] --> B[batch processing] & C[stream processing] --> D[downstream]
end
In the machine learning world…
graph TB
proddata --> sampled
model --> prodpipeline
subgraph "local env"
sampled[sampled data] --> local[local experimentation]
local <--> iterate
local --> model[finally, we have a production-ready model!]
end
subgraph "prod env"
proddata[production data] --> prodpipeline[production pipelines]
end
In the machine learning world…
graph TB
proddata --> sampled
model -- "code rewrite" --> prodpipeline
linkStyle 1 color:white;
subgraph "local env"
sampled[sampled data] --> local[local experimentation]
local <--> iterate
local --> model[finally, we have a production-ready model!]
end
subgraph "prod env"
proddata[production data] --> prodpipeline[production pipelines]
end