Using DuckDB + Ibis for RAG

blog
llms
duckdb
Author

Cody Peterson

Published

February 22, 2024

Overview

In this post, we’ll demonstrate traditional retrieval-augmented generation (RAG) with DuckDB and OpenAI via Ibis and discuss the pros and cons. Notice that because Ibis is portable, you can use (almost) any database Ibis supports for RAG!

Which databases can I use?

The database must support array types and have some form of similarity metric between arrays of numbers. Alternatively, a custom user-defined function (UDF) can be used for the similarity metric.

The performance of calculating similarity will also be much faster if the database supports fixed-sized arrays as DuckDB recently launched in version 0.10.0. We’re still using 0.9.2 in this post, but it’ll be easy to upgrade.

DuckDB is the default backend for Ibis and makes rapid prototyping locally very easy. You can deploy it to MotherDuck just by changing the connection string, or swap out DuckDB for another Ibis backend.

What is RAG?

Retrieval-augmented generation (RAG) is a technique for language model systems to retrieve relevant information from a knowledge base, augment the model’s input with that knowledge, and then generate a response. It serves to provide useful context given language models’ context length limitations and increase the quality of the response.

Is RAG dead?

With increasing language model context lengths, is RAG dead?

No. Though as we’ll discuss later, it’s unclear if traditional vector similarity search RAG is the best approach. Regardless, just because a language model can use the entire content of a textbook doesn’t mean you’ll get better performance (accuracy or efficiency) than first retrieving the relevant subsets of the textbook.

Traditional RAG embeds text into vectors of numbers and then computes a similarity metric between input text and the knowledge base to determine the most relevant context.

graph LR
    style T fill:#1E90FF,stroke:#1E90FF,color:#ffffff
    style N fill:#1E90FF,stroke:#1E90FF,color:#ffffff
    style C fill:#1E90FF,stroke:#1E90FF,color:#ffffff

    T("text") --> N("numbers (embedding vector)")
    N --> C("compute similarity metric")

Let’s jump into the code!

RAG over Ibis documentation

We’ll use the Ibis documentation as our knowledge base.

Setup code

First we need to install and import the necessary packages:

pip install 'ibis-framework[duckdb]' openai python-dotenv
import os
import ibis

from openai import OpenAI
from dotenv import load_dotenv
from pathlib import Path
1
Import the libraries we’ll use.

We need to set the OPENAI_API_KEY environment variable to our OpenAI API key. Do this by creating a .env file in the same directory as the code and adding the following line:

OPENAI_API_KEY=<your-api-key>
1
Replace with your OpenAI API key.

Then load the environment variables:

load_dotenv()
1
Load environment variables from the .env file.
True

Now we’re ready to make OpenAI calls.

Let’s also setup Ibis for interactive use:

ibis.options.interactive = True
1
Turn on Ibis interactive mode.

Getting the data into DuckDB

Let’s prepare to ingest all the Quarto markdown files into a DuckDB database called rag.ddb:

rag_con = ibis.connect("duckdb://rag.ddb")

for table in rag_con.list_tables():
    rag_con.drop_table(table)

rag_con.list_tables()
1
Create the Ibis connection to the DuckDB database.
2
Drop any existing tables from previous runs.
3
List the tables in the database to show that there are none.
[]

We’ll walk through the Ibis docs/ directory. For .qmd files, read the contents and insert them into the ibis table:

Tip

We are running this blog within the Ibis repository. To run this on your own data, adjust the code accordingly!

table_name = "ibis"

for filepath in Path(ibis.__file__).parents[1].joinpath("docs").rglob("*.qmd"):
    contents = filepath.read_text()

    data = {
        "filepath": [str(filepath).split("ibis/")[1]],
        "contents": [contents],
    }

    t = ibis.memtable(data)

    if "ibis" not in rag_con.list_tables():
        rag_con.create_table("ibis", t)
    else:
        rag_con.insert("ibis", t)

for table in rag_con.list_tables():
    if "memtable" in table:
        rag_con.drop_view(table)

rag_con.list_tables()
1
Define the table name.
2
Walk through the docs/ directory and find all .qmd files.
3
Read the contents of each file.
4
Create a dictionary with the file path and contents.
5
Create an Ibis memtable from the dictionary.
6
If the table doesn’t exist, create it and insert the data.
7
Otherwise, insert the data into the existing table.
8
Drop any memtables that were created.
9
List the tables in the database to show that the ibis table was created.
['ibis']

Let’s inspect the table:

t = rag_con.table(table_name)
t
1
Get the ibis table from the DuckDB database.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                                contents                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstring                                                                                    │
├────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ docs/why.qmd                          ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …    │
│ docs/404.qmd                          # Page not found\n\nThe page you requested cannot be found (perhaps it was moved …        │
│ docs/support_matrix.qmd               ---\ntitle: "Operation support matrix"\nformat: dashboard\nhide:\n  - toc\n---\n\n## {…   │
│ docs/index.qmd                        ---\ntitle: "Ibis"\ndescription: "the portable Python dataframe library"\nrepo-act…       │
│ docs/install.qmd                      ---\nhide:\n  - toc\n  - navigation\n  - footer\n---\n\n# Installation\n\n{{< include ./… │
│ docs/getting-started.qmd              ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …     │
│ docs/posts.qmd                        ---\ntitle: "Posts"\nlisting:\n  contents: posts\n  sort: "date desc"\n  image-place…     │
│ docs/presentations.qmd                ---\ntitle: "Presentations"\nsidebar: presentations\n---\n\nIbis presentations are l…     │
│ docs/posts/bigquery-arrays/index.qmd  ---\ntitle: Working with arrays in Google BigQuery\nauthor: "Phillip Cloud"\ndate:…       │
│ docs/posts/website-to-quarto/index.qmd---\ntitle: "Migrating Ibis documentation to Quarto"\nauthor: "Cody"\ndate: "2023-…       │
│                                                                                          │
└────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Now we have a DuckDB database with the ibis table containing all the file paths and contents of the Quarto markdown files in the ibis repository. You can adjust this code to work with any repository and any file type(s)! For instance, we may also want to ingest all the .py files in the Ibis repsoitory.

Embedding text to vectors

To search for similar text in traditional RAG fashion, we need to embed the text into vectors. We can use OpenAI’s embedding models to perform this step.

Note that the maximum length of input tokens is 8,191. A word typically consists of 1-3 tokens. To estimate the number of tokens of a string in Python, we can divide the length of that string by 4. This isn’t perfect, but will work for now. Let’s augmented our data with the estimated number of tokens and sort from largest to smallest:

t = (
    t.mutate(tokens_estimate=t["contents"].length() // 4)
    .order_by(ibis._["tokens_estimate"].desc())
    .relocate("filepath", "tokens_estimate")
)
t
1
Augment the table with the estimated number of tokens.
2
Sort the table by the estimated number of tokens in descending order.
3
Relocate the filepath and tokens_estimate columns to the front of the table.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                                       tokens_estimate  contents                                                                             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64string                                                                               │
├───────────────────────────────────────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ docs/posts/ibis-analytics/index.qmd          14401---\ntitle: Modern, hybrid, open analytics\nauthor: "Cody"\ndate: "2024-01-25"\nima… │
│ docs/backends/impala.qmd                     7677# Impala\n\n[https://impala.apache.org](https://impala.apache.org)\n\n![](https://i… │
│ docs/tutorials/ibis-for-sql-users.qmd        6777# Tutorial: Ibis for SQL users\n\n## Prerequisites\n\n{{< include ../_tabsets/insta… │
│ docs/posts/lms-for-data/index.qmd            5508---\ntitle: "Using language models for data"\nauthor: "Cody Peterson"\ndate: "2024…  │
│ docs/posts/pydata-performance/index.qmd      4801---\ntitle: "Ibis versus X: Performance across the ecosystem part 1"\nauthor: "Ph…   │
│ docs/tutorials/ibis-for-pandas-users.qmd     4455# Tutorial: Ibis for pandas users\n\n## Prerequisites\n\n{{< include ../_tabsets/in… │
│ docs/posts/campaign-finance/index.qmd        4241---\ntitle: "Exploring campaign finance data"\nauthor: "Nick Crews"\ndate: "2023-0…  │
│ docs/posts/duckdb-for-rag/index.qmd          3951---\ntitle: "Using DuckDB + Ibis for RAG"\nauthor: "Cody Peterson"\ndate: "2024-02…  │
│ docs/posts/backend-agnostic-arrays/index.qmd 3471---\ntitle: Backend agnostic arrays\nauthor: "Phillip Cloud"\ndate: 2024-01-19\ncat… │
│ docs/posts/ibis_substrait_to_duckdb/index.qmd3323---\ntitle: "Ibis + Substrait + DuckDB"\nauthor: Gil Forsyth\ndate: 2023-02-01\ncat… │
│                                                                                     │
└───────────────────────────────────────────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

The longest text is certainly over our embedding model’s token limit, and the second longest might be as well. In practice, we’d want to chunk our text into smaller pieces. This is its own challenge, but for the purposes of this demonstration we’ll just ignore text that is too long.

Let’s define our embedding functions:

def _embed(text: str) -> list[float]:
    """Text to fixed-length array embedding."""

    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    client = OpenAI()

    try:
        return (
            client.embeddings.create(input=[text], model=model).data[0].embedding
        )
    except Exception as e:
        print(e)
    return None


@ibis.udf.scalar.python
def embed(text: str, tokens_estimate: int) -> list[float]:
    """Text to fixed-length array embedding."""

    if 0 < tokens_estimate < 8191:
        return _embed(text)
    return None
1
Define a regular Python function to embed text.
2
Define the embedding model to use.
3
Replace newlines with spaces.
4
Create an OpenAI client.
5
Since it’s an external API call, we need to handle exceptions.
6
Call the OpenAI API to embed the text.
7
Handle exceptions and return None if an error occurs.
8
Define an Ibis UDF using the Python function.
9
If the estimated number of tokens is within the limit…
10
…call the Python function to embed the text.
11
Otherwise, return None.

Let’s embed the text:

t = t.mutate(
    embedding=embed(t["contents"], t["tokens_estimate"])
).cache()
t
1
Augment the table with the embedded text and cache the result.
Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8413 tokens (8413 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                                       tokens_estimate  contents                                                                              embedding                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringint64stringarray<float64>                                              │
├───────────────────────────────────────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────┤
│ docs/posts/ibis-analytics/index.qmd          14401---\ntitle: Modern, hybrid, open analytics\nauthor: "Cody"\ndate: "2024-01-25"\nima…NULL                                                        │
│ docs/backends/impala.qmd                     7677# Impala\n\n[https://impala.apache.org](https://impala.apache.org)\n\n![](https://i…NULL                                                        │
│ docs/tutorials/ibis-for-sql-users.qmd        6777# Tutorial: Ibis for SQL users\n\n## Prerequisites\n\n{{< include ../_tabsets/insta…[-0.032762881368398666, 0.03934009373188019, ... +1534]     │
│ docs/posts/lms-for-data/index.qmd            5508---\ntitle: "Using language models for data"\nauthor: "Cody Peterson"\ndate: "2024… [-0.019190069288015366, 0.041595082730054855, ... +1534]    │
│ docs/posts/pydata-performance/index.qmd      4801---\ntitle: "Ibis versus X: Performance across the ecosystem part 1"\nauthor: "Ph…  [0.00992082990705967, 0.00490616587921977, ... +1534]       │
│ docs/tutorials/ibis-for-pandas-users.qmd     4455# Tutorial: Ibis for pandas users\n\n## Prerequisites\n\n{{< include ../_tabsets/in…[-0.007494321092963219, 0.028004156425595284, ... +1534]    │
│ docs/posts/campaign-finance/index.qmd        4241---\ntitle: "Exploring campaign finance data"\nauthor: "Nick Crews"\ndate: "2023-0… [-0.01551054883748293, 0.01505258958786726, ... +1534]      │
│ docs/posts/duckdb-for-rag/index.qmd          3951---\ntitle: "Using DuckDB + Ibis for RAG"\nauthor: "Cody Peterson"\ndate: "2024-02… [-0.018623681738972664, 0.05166473984718323, ... +1534]     │
│ docs/posts/backend-agnostic-arrays/index.qmd 3471---\ntitle: Backend agnostic arrays\nauthor: "Phillip Cloud"\ndate: 2024-01-19\ncat…[-0.0014422409003600478, -0.0041314102709293365, ... +1534] │
│ docs/posts/ibis_substrait_to_duckdb/index.qmd3323---\ntitle: "Ibis + Substrait + DuckDB"\nauthor: Gil Forsyth\ndate: 2023-02-01\ncat…[-0.02641780860722065, 0.06054646894335747, ... +1534]      │
│                                                            │
└───────────────────────────────────────────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────┘

Notice there was one error – that second row’s content was indeed too long! Our token estimate is a bit off, but overall useful. We accounted for this possibility in our Python UDF above, returning None (mapped to NULL in DuckDB) when the text is too long or an error occurs.

Searching with cosine similarity

Cosine similarity over vectors is a common way to measure similarity for RAG applications. We’ll explore its downsides later. We need to make Ibis aware of DuckDB’s built-in list_cosine_similarity function:

@ibis.udf.scalar.builtin
def list_cosine_similarity(x, y) -> float:
    """Compute cosine similarity between two vectors."""
1
Use an Ibis builtin UDF to define a cosine similarity function.
Tip

If using a different backend, you’ll need to use its corresponding cosine similarity function or define your own.

Now we can search for similar text in the documentation:

def search_docs(text):
    """Search documentation for similar text, returning a sorted table"""

    embedding = _embed(text)

    s = (
        t.mutate(similarity=list_cosine_similarity(t["embedding"], embedding))
        .relocate("similarity")
        .order_by(ibis._["similarity"].desc())
        .cache()
    )

    return s
1
Define a function to search the documentation.
2
Embed the input text.
3
Augment the table with the cosine similarity between the embedded text and the input text.
4
Relocate the similarity column to the front of the table.
5
Sort the table by the cosine similarity in descending order.
6
Cache the result.
text = "where can I chat with the community about Ibis?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                                      tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼──────────────────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.661852docs/posts/zulip/index.qmd                                  440---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…   [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.578235docs/concepts/who.qmd                                       800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.569246docs/why.qmd                                                731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.560286docs/concepts/user-testimonials.qmd                         1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.551246docs/getting-started.qmd                                    157---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis … [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.527626docs/contribute/02_workflow.qmd                             574# Contribute to the Ibis codebase\n\n## Getting started\n\nFirst, set up a [develop…  [-0.01897418312728405, 0.05570495128631592, ... +1534]    │
│   0.522431docs/posts/ibis-version-4.0.0-release/index.qmd             743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.519414docs/posts/Ibis-version-3.0.0-release/index.qmd             2038---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n[-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.517491docs/tutorials/open-source-software/apache-flink/0_setup.qmd425# Getting started\n\nIn this tutorial, you will learn how to set up and use Flink…    [-0.049202971160411835, 0.037027861922979355, ... +1534]  │
│   0.510845docs/_code/setup_penguins.qmd                               85```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…  [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│                                                                   │
└────────────┴──────────────────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

Now that we have retrieved the most similar documentation, we can augment our language model’s input with that context prior to generating a response! In practice, we’d probably want to set a similarity threshold and take the top N results. Chunking our text into smaller pieces and selecting from those results would also be a good idea.

Let’s try a few more queries:

text = "what do users say about Ibis?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.574682docs/concepts/user-testimonials.qmd            1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.562258docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.544948docs/concepts/who.qmd                          800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.528427docs/posts/zulip/index.qmd                     440---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…   [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.523075docs/getting-started.qmd                       157---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis … [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.516679docs/posts/ibis-version-4.0.0-release/index.qmd743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.500733docs/_tabsets/install_default.qmd              127We recommend starting with the default (DuckDB) backend for a performant, fully…      [-0.004252271726727486, 0.025945203378796577, ... +1534]  │
│   0.497463docs/how-to/configure/basics.qmd               611# Basic configuration\n\nIf you don't have your own data, you can load example da…    [-0.02095085009932518, 0.04434989020228386, ... +1534]    │
│   0.491889docs/posts/ibis-to-file/index.qmd              700---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…   [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.484358docs/_code/setup_penguins.qmd                  85```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…  [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
text = "can Ibis complete the one billion row challenge?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.608117docs/posts/1brc/index.qmd                      2129---\ntitle: "Using one Python dataframe API to take the billion row challenge wi…     [0.013509240001440048, -0.0003828565531875938, ... +1534] │
│   0.504276docs/concepts/who.qmd                          800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.482052docs/presentations/overview.qmd                1697---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…  [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.473564docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.468842docs/posts/Ibis-version-3.0.0-release/index.qmd2038---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n[-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.465956docs/tutorials/data-platforms/clickhouse.qmd   826---\ntitle: ClickHouse\nfreeze: auto\n---\n\n[Ibis](https://ibis-project.com) suppor… [-0.027811484411358833, 0.02419976517558098, ... +1534]   │
│   0.465920docs/concepts/user-testimonials.qmd            1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.456237docs/_code/setup_penguins.qmd                  85```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…  [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│   0.455206docs/how-to/visualization/seaborn.qmd          159# seaborn + Ibis\n\nIf you don't have data to visualize, you can load an example …    [-0.012414666824042797, 0.012007818557322025, ... +1534]  │
│   0.453313docs/how-to/input-output/duckdb-parquet.qmd    651---\ntitle: Read parquet files with Ibis\n---\n\nIn this example, we will use Ibis'…  [0.01082119531929493, 0.04467007517814636, ... +1534]     │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
text = "teach me about the Polars backend"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.597506docs/backends/polars.qmd                       402# Polars\n\n[https://www.pola.rs](https://www.pola.rs)\n\n{{< include /_callouts/ex…  [-0.04319469630718231, 0.030947020277380943, ... +1534]   │
│   0.396795docs/posts/pydata-performance-part2/index.qmd  1567---\ntitle: "Ibis versus X: Performance across the ecosystem part 2"\nauthor: "Ph…    [-0.0013977460330352187, 0.028734903782606125, ... +1534] │
│   0.364785docs/posts/ibis-version-4.0.0-release/index.qmd743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.356085docs/_callouts/experimental_backend.qmd        26::: {.callout-warning}\nThis backend is experimental and is subject to backwards…     [0.017363416031003, 0.03505472466349602, ... +1534]       │
│   0.326378docs/how-to/input-output/multiple-backends.qmd 331# Work with multiple backends\n\nYou can work with multiple backends by creating …    [-0.014367456547915936, 0.01823212020099163, ... +1534]   │
│   0.324941docs/presentations/overview.qmd                1697---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…  [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.315046docs/backends/_templates/api.qmd               121```{python}\n#| echo: false\n#| output: asis\n\nfrom _utils import get_backend, ren…  [-0.013808815740048885, 0.03160477802157402, ... +1534]   │
│   0.311201docs/backends/postgresql.qmd                   493# PostgreSQL\n\n[https://www.postgresql.org](https://www.postgresql.org)\n\n![](htt…  [-0.046312808990478516, 0.031841542571783066, ... +1534]  │
│   0.300936docs/posts/ibis-version-8.0.0-release/index.qmd2062---\ntitle: "Ibis 8.0: streaming and more!"\nauthor: "Ibis team"\ndate: "2024-02-1…   [0.005049674306064844, 0.03530928120017052, ... +1534]    │
│   0.298417docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
text = "why does Voltron Data support Ibis?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.679104docs/concepts/who.qmd                          800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.616018docs/posts/why-voda-supports-                  3056---\ntitle: "Why Voltron Data supports Ibis"\nauthor: "Cody Peterson + Ian Cook"\n[-0.0027232086285948753, 0.0460839718580246, ... +1534]   │
│   0.582328docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.536979docs/_code/input_output_penguins.qmd           947## Data platforms\n\nYou can connect Ibis to any supported backend to read and wr…    [-0.009215430356562138, 0.020675163716077805, ... +1534]  │
│   0.528810docs/concepts/user-testimonials.qmd            1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.523802docs/presentations/overview.qmd                1697---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…  [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.516846docs/posts/ibis-version-4.0.0-release/index.qmd743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.515140docs/concepts/composable-ecosystem.qmd         1490# Composable data ecosystem\n\nIbis exists in a broader composable data ecosystem…    [-0.0018551592947915196, 0.0542871318757534, ... +1534]   │
│   0.513805docs/backends/trino.qmd                        743# Trino\n\n[https://trino.io](https://trino.io)\n\n{{< include /_callouts/experimen…  [-0.04838176444172859, 0.04063660651445389, ... +1534]    │
│   0.500496docs/how-to/input-output/basics.qmd            105# Basic input/output\n\nIf you don't have your own data, you can load example dat…    [-0.00725388852879405, 0.047017041593790054, ... +1534]   │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
text = "why should I use Ibis?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.621761docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.589931docs/concepts/who.qmd                          800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.580506docs/_tabsets/install_default.qmd              127We recommend starting with the default (DuckDB) backend for a performant, fully…      [-0.004252271726727486, 0.025945203378796577, ... +1534]  │
│   0.564820docs/concepts/user-testimonials.qmd            1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.563156docs/getting-started.qmd                       157---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis … [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.549894docs/_code/setup_penguins.qmd                  85```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…  [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│   0.545518docs/how-to/configure/basics.qmd               611# Basic configuration\n\nIf you don't have your own data, you can load example da…    [-0.02095085009932518, 0.04434989020228386, ... +1534]    │
│   0.542379docs/posts/ibis-version-4.0.0-release/index.qmd743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.539636docs/posts/ibis-to-file/index.qmd              700---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…   [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.534821docs/_callouts/pypi_warning.qmd                59::: {.callout-warning}\nNote that the `ibis-framework` package is _not_ the same…     [0.0019332439405843616, 0.01002512127161026, ... +1534]   │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
text = "what is the Ibis roadmap?"
search_docs(text)
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity  filepath                                         tokens_estimate  contents                                                                                embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64stringint64stringarray<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.569501docs/posts/roadmap-2024-H1/index.qmd           2958---\ntitle: "Ibis project 2024 roadmap"\nauthor: "Cody Peterson"\ndate: "2024-02-1…   [-0.011533920653164387, 0.07820627093315125, ... +1534]   │
│   0.563448docs/why.qmd                                   731---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …[-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.555909docs/concepts/who.qmd                          800# Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…    [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.544127docs/posts/ibis-version-4.0.0-release/index.qmd743---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n[-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.524721docs/posts/zulip/index.qmd                     440---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…   [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.514110docs/concepts/user-testimonials.qmd            1079# User testimonials\n\nThis page collects user testimonials about Ibis from the c…    [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.510652docs/posts/Ibis-version-3.0.0-release/index.qmd2038---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n[-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.507567docs/posts/ibis-to-file/index.qmd              700---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…   [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.505812docs/posts/why-voda-supports-                  3056---\ntitle: "Why Voltron Data supports Ibis"\nauthor: "Cody Peterson + Ian Cook"\n[-0.0027232086285948753, 0.0460839718580246, ... +1534]   │
│   0.501638docs/getting-started.qmd                       157---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis … [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│                                                                   │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

RAG pitfalls

As in traditional software engineering and machine learning, the architecture when building systems around language models is not a universal. The best depends on the application of the system. Using cosine (or otherwise) similarity search over text transformed to numbers may not always perform best.

We can demonstrate this with the following scenario: let’s assume we’re building a chatbot that writes and runs SQL on the user’s behalf based on their questions posed in English (or any language the language model understand sufficiently). We might store past queries, the SQL generated, and some evaluation criteria (upvote/downvote on the generated SQL from the user) in a database. Then we can use the same RAG approach above to augment our queries with the user’s past queries and the generated SQL.

Take the following English query:

a = """the first 10 rows of the penguins, ordered by body_mass_g from
lightest to heaviest"""

We can compute the cosine similarity of this query against itself, resulting in 1.0 or close to it:

list_cosine_similarity(_embed(a), _embed(a))

0.9999994546885027

Now, let’s flip the ordering. This would result in an entirely different SQL query and result set:

b = """the first 10 rows of the penguins, ordered by body_mass_g from
heaviest to lightest"""

And compute the cosine similarity of a and b:

list_cosine_similarity(_embed(a), _embed(b))

0.9934337761234602

Similarly, we can construct a semantically equivalent query that would result in the same SQL as example a:

c = """of the birds retrieve the initial ten opposite-of-columns,
sorted from biggest to smallest by weight"""

That has a much lower cosine similarity in the embedding space:

list_cosine_similarity(_embed(a), _embed(c))

0.5682471739636249

Despite semantic equivalence, these two queries would not be considered similar. You need to be careful and understand what the system is doing.

Discussion

Is traditional RAG the best approach? It’s unclear, and it depends.

System architecture

Often RAG-based systems introduce yet-another-database to store the embeddings. Hopefully this post has demonstrated that traditional database options can work just fine. While we’re only demonstrating this on a few rows here, DuckDB can scale up to many millions with ease.

Cost and speed

The cost of running this demo from OpenAI is less than $0.01 USD. It also runs in about a minute, but with many rows this would be much slower. You need to consider your architecture here. One option would be using an open source embedding model that can run locally. Swap out the OpenAI calls with that and see how it goes!

Alternatives to vector search RAG

It’s clear that we should always be augmenting our language model calls with the most relevant context, even if traditional vector search RAG is not the best solution. The problem then becomes the best way to retrieve that context for a given system’s desired behavior.

Abolish RAG as terminology?

“I would like to abolish the term RAG and instead just agree that we should always try to provide models with the appropriate context to provide high quality answers.” - Hamel Husain

Some alternatives to vector search RAG include:

  • traditional search methods
  • using an external web search engine
  • using a language model for search

The last point is particularly interesting – in the above examples with a and c strings, cosine similarity in the embedding space could not tell the queries are semantically equivalent. However, a sufficiently good language model certainly could. As we demonstrated in language models for data, we could even setup a classifier and use the logit bias trick to compute similarity with a single output token.

This trick makes the language model faster, but for practical applications the cost and speed of language models for search is still a concern. We can expect this concern to be mitigated over time as language models become faster and cheaper to run.

With larger context lengths, we could also use an initial call to a language model to extract out relevant sections of text and then make a second call with that augmented context.

Next steps

Try this out yourself! All you need is an OpenAI account and the code above.

It’s never been a better time to get involved with Ibis. Join us on Zulip and introduce yourself! More Ibis + language model content coming soon.

Back to top