Taking a random cube for a walk and making it talk

blog
duckdb
udfs
Author

Cody Peterson

Published

September 26, 2024

Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.

To follow along, install the required libraries:

pip install 'ibis-framework[duckdb]' faker plotly

A random cube

We’ll generate a random cube of data with Ibis (default DuckDB backend) and visualize it as a 3D line plot:

Show me the code!
import ibis
import ibis.selectors as s
import plotly.express as px

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 5

con = ibis.connect("duckdb://synthetic.ddb")

if "source" in con.list_tables():
    t = con.table("source")
else:
    lookback = ibis.interval(days=1)
    step = ibis.interval(seconds=1)

    t = (
        (
            ibis.range(
                ibis.now() - lookback,
                ibis.now(),
                step=step,
            )
            .unnest()
            .name("timestamp")
            .as_table()
        )
        .mutate(
            index=(ibis.row_number().over(order_by="timestamp")),
            **{col: 2 * (ibis.random() - 0.5) for col in ["a", "b", "c"]},
        )
        .mutate(color=ibis._["index"].histogram(nbins=8))
        .drop("index")
        .relocate("timestamp", "color")
        .order_by("timestamp")
    )

    t = con.create_table("source", t.to_pyarrow())

c = px.line_3d(
    t,
    x="a",
    y="b",
    z="c",
    color="color",
    hover_data=["timestamp"],
)
c
1
Import the necessary libraries.
2
Enable interactive mode for Ibis.
3
Connect to an on-disk DuckDB database.
4
Load the table if it already exists.
5
Define the time range and step for the data.
6
Create the array of timestamps.
7
Unnest the array to a column.
8
Name the column “timestamp”.
9
Convert the column into a table.
10
Create a monotonically increasing index column.
11
Create three columns of random numbers.
12
Create a color column based on the index (help visualize the time series).
13
Drop the index column.
14
Rearrange the columns.
15
Order the table by timestamp.
16
Store the table in the on-disk database.
17
Create a 3D line plot of the data.

Walking

We have a random cube of data:

t
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ timestamp                color  a          b          c         ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ timestamp(6)int64float64float64float64   │
├─────────────────────────┼───────┼───────────┼───────────┼───────────┤
│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │
│ 2024-07-23 23:35:07.01000.307479-0.923701-0.479673 │
│ 2024-07-23 23:35:08.01000.1361950.6045830.078360 │
│ 2024-07-23 23:35:09.0100-0.2618670.9252870.339049 │
│ 2024-07-23 23:35:10.01000.8136230.255287-0.079172 │
│  │
└─────────────────────────┴───────┴───────────┴───────────┴───────────┘

But we need to make it walk. We’ll use a window function to calculate the cumulative sum of each column:

window = ibis.window(order_by="timestamp", preceding=None, following=0)
walked = t.select(
    "timestamp",
    "color",
    a=t["a"].sum().over(window),
    b=t["b"].sum().over(window),
    c=t["c"].sum().over(window),
).order_by("timestamp")
walked
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ timestamp                color  a          b          c        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ timestamp(6)int64float64float64float64  │
├─────────────────────────┼───────┼───────────┼───────────┼──────────┤
│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │
│ 2024-07-23 23:35:07.0100-0.529928-1.6054170.213133 │
│ 2024-07-23 23:35:08.0100-0.393733-1.0008340.291492 │
│ 2024-07-23 23:35:09.0100-0.655600-0.0755470.630542 │
│ 2024-07-23 23:35:10.01000.1580240.1797400.551369 │
│  │
└─────────────────────────┴───────┴───────────┴───────────┴──────────┘
window = ibis.window(order_by="timestamp", preceding=None, following=0)
walked = t.select(
    "timestamp",
    "color",
    s.across(
        s.c("a", "b", "c"),
        ibis._.sum().over(window),
    ),
).order_by("timestamp")
walked
1
Alternatively, you can use s.of_type(float) to select all float columns.
2
Use the ibis._ selector to reference a deferred column expression.
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ timestamp                color  a          b          c        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ timestamp(6)int64float64float64float64  │
├─────────────────────────┼───────┼───────────┼───────────┼──────────┤
│ 2024-07-23 23:35:06.0100-0.837407-0.6817160.692806 │
│ 2024-07-23 23:35:07.0100-0.529928-1.6054170.213133 │
│ 2024-07-23 23:35:08.0100-0.393733-1.0008340.291492 │
│ 2024-07-23 23:35:09.0100-0.655600-0.0755470.630542 │
│ 2024-07-23 23:35:10.01000.1580240.1797400.551369 │
│  │
└─────────────────────────┴───────┴───────────┴───────────┴──────────┘

While the first few rows may look similar to the cube, the 3D line plot does not:

Show me the code!
c = px.line_3d(
    walked,
    x="a",
    y="b",
    z="c",
    color="color",
    hover_data=["timestamp"],
)
c

Talking

We’ve made our random cube and we’ve made it walk, but now we want to make it talk. At this point, you might be questioning the utility of this blog post – what are we doing and why? The purpose is to demonstrate generating synthetic data that can look realistic. We achieve this by building in randomness (e.g. a random walk can be used to simulate stock prices) and also by using that randomness to inform the generation of non-numeric synthetic data (e.g. the ticker symbol of a stock).

Faking it

Let’s demonstrate this concept by pretending we have an application where users can review a location they’re at. The user’s name, comment, location, and device info are stored in our database for their review at a given timestamp.

Faker is a commonly used Python library for generating fake data. We’ll use it to generate fake names, comments, locations, and device info for our reviews:

from faker import Faker

fake = Faker()

res = (
    fake.name(),
    fake.sentence(),
    fake.location_on_land(),
    fake.user_agent(),
    fake.ipv4(),
)
res
('Robyn Foster',
 'Employee security there meeting.',
 ('41.75338', '-86.11084', 'Granger', 'US', 'America/Indiana/Indianapolis'),
 'Mozilla/5.0 (iPod; U; CPU iPhone OS 3_2 like Mac OS X; unm-US) AppleWebKit/533.16.7 (KHTML, like Gecko) Version/3.0.5 Mobile/8B118 Safari/6533.16.7',
 '119.243.96.150')

We can use our random numbers to influence the fake data generation in a Python UDF:

Show me the code!
import ibis.expr.datatypes as dt

from datetime import datetime, timedelta

ibis.options.repr.interactive.max_length = 5

record_schema = dt.Struct(
    {
        "timestamp": datetime,
        "name": str,
        "comment": str,
        "location": list[str],
        "device": dt.Struct(
            {
                "browser": str,
                "ip": str,
            }
        ),
    }
)


@ibis.udf.scalar.python
def faked_batch(
    timestamp: datetime,
    a: float,
    b: float,
    c: float,
    batch_size: int = 8,
) -> dt.Array(record_schema):
    """
    Generate records of fake data.
    """
    value = (a + b + c) / 3

    res = [
        {
            "timestamp": timestamp + timedelta(seconds=0.1 * i),
            "name": fake.name() if value >= 0.5 else fake.first_name(),
            "comment": fake.sentence(),
            "location": fake.location_on_land(),
            "device": {
                "browser": fake.user_agent(),
                "ip": fake.ipv4() if value >= 0 else fake.ipv6(),
            },
        }
        for i in range(batch_size)
    ]

    return res


if "faked" in con.list_tables():
    faked = con.table("faked")
else:
    faked = (
        t.mutate(
            faked=faked_batch(t["timestamp"], t["a"], t["b"], t["c"]),
        )
        .select(
            "a",
            "b",
            "c",
            ibis._["faked"].unnest(),
        )
        .unpack("faked")
        .drop("a", "b", "c")
    )

    faked = con.create_table("faked", faked)

faked
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ timestamp                name    comment                                  location                                                          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestamp(6)stringstringarray<string>                                                     │
├─────────────────────────┼────────┼─────────────────────────────────────────┼───────────────────────────────────────────────────────────────────┤
│ 2024-07-23 23:35:06.010GlendaThan available eye.                    ['13.65805', '102.56365', 'Paoy Paet', 'KH', 'Asia/Phnom_Penh']   │
│ 2024-07-23 23:35:06.110TrevorAbility commercial admit adult he.     ['56.9083', '60.8019', 'Beryozovsky', 'RU', 'Asia/Yekaterinburg'] │
│ 2024-07-23 23:35:06.210Janet Sign fact time against energy.         ['25.66795', '85.83636', 'Dalsingh Sarai', 'IN', 'Asia/Kolkata']  │
│ 2024-07-23 23:35:06.310AngelaHappen Democrat public office whatever.['45.78071', '12.84052', 'Portogruaro', 'IT', 'Europe/Rome']      │
│ 2024-07-23 23:35:06.410Donna Travel none coach crime within lawyer. ['28.15112', '-82.46148', 'Lutz', 'US', 'America/New_York']       │
│                                                                  │
└─────────────────────────┴────────┴─────────────────────────────────────────┴───────────────────────────────────────────────────────────────────┘

And now we have a “realistic” dataset of fake reviews matching our desired schema. You can adjust this to match the schema and expected distributions of your own data and scale it up as needed.

GenAI/LLMs

The names and locations from Faker are bland and unrealistic. The comments are nonsensical. And most importantly, we haven’t filled our quota for blogs mentioning AI. You could use language models in Ibis UDFs to generate more realistic synthetic data. We could use “open source” language models to do this locally for free, an exercise left to the reader.

Next steps

If you’ve followed along, you have a synthetic.ddb file with a couple tables:

con.list_tables()
['faked', 'source']

We can estimate the size of data generated:

import os

size_in_mbs = os.path.getsize("synthetic.ddb") / (1024 * 1024)
print(f"synthetic.ddb: {size_in_mbs:.2f} MBs")
synthetic.ddb: 54.51 MBs

You can build from here to generate realistic synthetic data at any scale for any use case.

Back to top