Ibis goes real-time! Introducing the new Flink backend for Ibis

Introduction

Ibis 8.0 marks the official release of the Apache Flink backend for Ibis. Ibis users can now manipulate data across streaming and batch contexts using the same interface. Flink is one of the most established stream-processing frameworks out there and a central part of the real-time data infrastructure at companies like DoorDash, LinkedIn, Netflix, and Uber. It is commonly applied in use cases such as fraud detection, anomaly detection, real-time recommendation, dynamic pricing, and online advertising. The Flink backend is also the first streaming backend Ibis supports. Follow along as we define and execute a simple streaming job using Ibis!

Installation prerequisites

Docker Compose: This tutorial uses Docker Compose to manage an Apache Kafka environment (including sample data generation) and a Flink cluster (for remote execution). You can download and install Docker Compose from the official website.
JDK 11: Flink requires Java 11. If you don’t already have JDK 11 installed, you can get the appropriate Eclipse Temurin release.
Python: To follow along, you need Python 3.9 or 3.10.

Installing the Flink backend for Ibis

We use a Python client to explore data in Kafka topics. You can install it, alongside the Flink backend for Ibis, with pip, conda, mamba, or pixi:

pip install ibis-framework apache-flink kafka-python

conda install -c conda-forge ibis-flink

mamba install -c conda-forge ibis-flink

pixi add ibis-flink

Spinning up the services using Docker Compose

The ibis-project/realtime-fraud-detection GitHub repository includes the relevant Docker Compose configuration for this tutorial. Clone the repository, and run docker compose up from the cloned directory to create Kafka topics, generate sample data, and launch a Flink cluster:

git clone https://github.com/claypotai/realtime-fraud-detection.git
cd realtime-fraud-detection
docker compose up

Tip

If you don’t intend to try remote execution, you can start only the Kafka-related services with docker compose up kafka init-kafka data-generator.

After a few seconds, you should see messages indicating your Kafka environment is ready:

realtime-fraud-detection-init-kafka-1      | Successfully created the following topics:
realtime-fraud-detection-init-kafka-1      | payment_msg
realtime-fraud-detection-init-kafka-1      | sink
realtime-fraud-detection-init-kafka-1 exited with code 0
realtime-fraud-detection-data-generator-1  | Connected to Kafka
realtime-fraud-detection-data-generator-1  | Producing 20000 records to Kafka topic payment_msg

This example uses mock payments data. The payment_msg Kafka topic contains messages in the following format:

{
    "createTime": "2023-09-20 22:19:02.224",
    "orderId": 1695248388,
    "payAmount": 88694.71922270155,
    "payPlatform": 0,
    "provinceId": 6
}

In a separate terminal, we can explore what these messages look like:

from itertools import islice

from kafka import KafkaConsumer

consumer = KafkaConsumer("payment_msg")
for msg in islice(consumer, 3):
    print(msg)

ConsumerRecord(topic='payment_msg', partition=0, offset=9430, timestamp=1720728902139, timestamp_type=0, key=None, value=b'{"createTime": "2024-07-11 20:15:02.139", "orderId": 1720733590, "payAmount": 21569.762559328065, "payPlatform": 0, "provinceId": 5}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=132, serialized_header_size=-1)
ConsumerRecord(topic='payment_msg', partition=0, offset=9431, timestamp=1720728902640, timestamp_type=0, key=None, value=b'{"createTime": "2024-07-11 20:15:02.640", "orderId": 1720733591, "payAmount": 81153.99302227204, "payPlatform": 0, "provinceId": 1}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=131, serialized_header_size=-1)
ConsumerRecord(topic='payment_msg', partition=0, offset=9432, timestamp=1720728903144, timestamp_type=0, key=None, value=b'{"createTime": "2024-07-11 20:15:03.144", "orderId": 1720733592, "payAmount": 33284.5361487847, "payPlatform": 0, "provinceId": 3}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=130, serialized_header_size=-1)

Running the tutorial

This tutorial uses Ibis with the Flink backend to process the aforementioned payment messages. You can choose to either run it locally or submit a job to an already-running Flink cluster.

Local execution

The simpler option is to run the example using the Flink mini cluster.

Create a table environment

The table environment serves as the main entry point for interacting with the Flink runtime. The flink backend does not create TableEnvironment objects; you must create a TableEnvironment and pass that to ibis.flink.connect:

import ibis
from pyflink.table import EnvironmentSettings, TableEnvironment

env_settings = EnvironmentSettings.in_streaming_mode()
table_env = TableEnvironment.create(env_settings)
table_env.get_config().set("parallelism.default", "1")

con = ibis.flink.connect(table_env)

1: write all the data to one file

Flink’s streaming connectors aren’t part of the binary distribution. Link the Kafka connector for cluster execution by adding the JAR file from the cloned repository. Ibis exposes the raw_sql method for situations like this, where you need to run arbitrary SQL that cannot be modeled as a table expression:

con.raw_sql("ADD JAR 'flink-sql-connector-kafka-3.0.2-1.18.jar'")

Create the source and sink tables

Use create_table to register tables. Notice the new top-level ibis.watermark API for specifying a watermark strategy.

source_schema = ibis.schema(
    {
        "createTime": "timestamp(3)",
        "orderId": "int64",
        "payAmount": "float64",
        "payPlatform": "int32",
        "provinceId": "int32",
    }
)

source_configs = {
    "connector": "kafka",
    "topic": "payment_msg",
    "properties.bootstrap.servers": "localhost:9092",
    "properties.group.id": "test_3",
    "scan.startup.mode": "earliest-offset",
    "format": "json",
}

t = con.create_table(
    "payment_msg",
    schema=source_schema,
    tbl_properties=source_configs,
    watermark=ibis.watermark(
        time_col="createTime", allowed_delay=ibis.interval(seconds=15)
    ),
)

sink_schema = ibis.schema(
    {
        "province_id": "int32",
        "pay_amount": "float64",
    }
)

sink_configs = {
    "connector": "kafka",
    "topic": "sink",
    "properties.bootstrap.servers": "localhost:9092",
    "format": "json",
}

con.create_table(
    "total_amount_by_province_id", schema=sink_schema, tbl_properties=sink_configs
)

1: create source Table
2: create sink Table

DatabaseTable: total_amount_by_province_id
  province_id int32
  pay_amount  float64

Perform calculations

Compute the total pay amount per province in the past 10 seconds (as of each message, for the province in the incoming message):

agged = t.select(
    province_id=t.provinceId,
    pay_amount=t.payAmount.sum().over(
        range=(-ibis.interval(seconds=10), 0),
        group_by=t.provinceId,
        order_by=t.createTime,
    ),
)

Finally, emit the query result to the sink table:

con.insert("total_amount_by_province_id", agged)

<pyflink.table.table_result.TableResult at 0x1695446a0>

Remote execution

You can also submit the example to the remote cluster started using Docker Compose. The window_aggregation.py file in the cloned repository contains the same steps that we performed for local execution. We will use the method described in the official Flink documentation.

Tip

You can find the ./bin/flink executable with the following command:

python -c'from pathlib import Path; import pyflink; print(Path(pyflink.__spec__.origin).parent / "bin" / "flink")'

My full command looks like this:

/opt/miniconda3/envs/ibis-dev/lib/python3.10/site-packages/pyflink/bin/flink run --jobmanager localhost:8081 --python window_aggregation.py

The command will exit after displaying a submission message:

Job has been submitted with JobID b816faaf5ef9126ea5b9b6a37012cf56

Viewing the results

Similar to how we viewed messages in the payment_msg topic, we can print results from the sink topic:

consumer = KafkaConsumer("sink")
for msg in islice(consumer, 10):
    print(msg)

ConsumerRecord(topic='sink', partition=0, offset=0, timestamp=1720728912679, timestamp_type=0, key=None, value=b'{"province_id":2,"pay_amount":16364.037210374616}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=49, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=1, timestamp=1720728912682, timestamp_type=0, key=None, value=b'{"province_id":2,"pay_amount":101689.41546504611}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=49, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=2, timestamp=1720728912683, timestamp_type=0, key=None, value=b'{"province_id":5,"pay_amount":32552.15122784454}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=3, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":5,"pay_amount":64135.92290912925}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=4, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":3,"pay_amount":34294.35463951969}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=5, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":6,"pay_amount":96762.09000335855}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=6, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":6,"pay_amount":154188.41978421973}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=49, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=7, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":6,"pay_amount":222023.00049557863}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=49, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=8, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":4,"pay_amount":72968.01696673119}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)
ConsumerRecord(topic='sink', partition=0, offset=9, timestamp=1720728912684, timestamp_type=0, key=None, value=b'{"province_id":0,"pay_amount":81034.40856839989}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=48, serialized_header_size=-1)

Voilà! You’ve run your first streaming application using Ibis.

Shutting down the Compose environment

Press Ctrl+C to stop the Docker Compose containers. Once stopped, run docker compose down to remove the services created for this tutorial.