Using DuckDB + Ibis for RAG

Overview

In this post, we’ll demonstrate traditional retrieval-augmented generation (RAG) with DuckDB and OpenAI via Ibis and discuss the pros and cons. Notice that because Ibis is portable, you can use (almost) any database Ibis supports for RAG!

Which databases can I use?

The database must support array types and have some form of similarity metric between arrays of numbers. Alternatively, a custom user-defined function (UDF) can be used for the similarity metric.

The performance of calculating similarity will also be much faster if the database supports fixed-sized arrays as DuckDB recently launched in version 0.10.0. We’re still using 0.9.2 in this post, but it’ll be easy to upgrade.

DuckDB is the default backend for Ibis and makes rapid prototyping locally very easy. You can deploy it to MotherDuck just by changing the connection string, or swap out DuckDB for another Ibis backend.

What is RAG?

Retrieval-augmented generation (RAG) is a technique for language model systems to retrieve relevant information from a knowledge base, augment the model’s input with that knowledge, and then generate a response. It serves to provide useful context given language models’ context length limitations and increase the quality of the response.

Is RAG dead?

With increasing language model context lengths, is RAG dead?

No. Though as we’ll discuss later, it’s unclear if traditional vector similarity search RAG is the best approach. Regardless, just because a language model can use the entire content of a textbook doesn’t mean you’ll get better performance (accuracy or efficiency) than first retrieving the relevant subsets of the textbook.

Traditional RAG embeds text into vectors of numbers and then computes a similarity metric between input text and the knowledge base to determine the most relevant context.

graph LR
    style T fill:#1E90FF,stroke:#1E90FF,color:#ffffff
    style N fill:#1E90FF,stroke:#1E90FF,color:#ffffff
    style C fill:#1E90FF,stroke:#1E90FF,color:#ffffff

    T("text") --> N("numbers (embedding vector)")
    N --> C("compute similarity metric")

Let’s jump into the code!

RAG over Ibis documentation

We’ll use the Ibis documentation as our knowledge base.

Setup code

First we need to install and import the necessary packages:

pip install 'ibis-framework[duckdb]' openai python-dotenv

import os
import ibis

from openai import OpenAI
from dotenv import load_dotenv
from pathlib import Path

1: Import the libraries we’ll use.

We need to set the OPENAI_API_KEY environment variable to our OpenAI API key. Do this by creating a .env file in the same directory as the code and adding the following line:

OPENAI_API_KEY=<your-api-key>

1: Replace with your OpenAI API key.

Then load the environment variables:

load_dotenv()

1: Load environment variables from the .env file.

True

Now we’re ready to make OpenAI calls.

Let’s also setup Ibis for interactive use:

ibis.options.interactive = True

1: Turn on Ibis interactive mode.

Getting the data into DuckDB

Let’s prepare to ingest all the Quarto markdown files into a DuckDB database called rag.ddb:

rag_con = ibis.connect("duckdb://rag.ddb")

for table in rag_con.list_tables():
    rag_con.drop_table(table)

rag_con.list_tables()

1: Create the Ibis connection to the DuckDB database.
2: Drop any existing tables from previous runs.
3: List the tables in the database to show that there are none.

[]

We’ll walk through the Ibis docs/ directory. For .qmd files, read the contents and insert them into the ibis table:

Tip

We are running this blog within the Ibis repository. To run this on your own data, adjust the code accordingly!

table_name = "ibis"

for filepath in Path(ibis.__file__).parents[1].joinpath("docs").rglob("*.qmd"):
    contents = filepath.read_text()

    data = {
        "filepath": [str(filepath).split("ibis/")[1]],
        "contents": [contents],
    }

    t = ibis.memtable(data)

    if "ibis" not in rag_con.list_tables():
        rag_con.create_table("ibis", t)
    else:
        rag_con.insert("ibis", t)

for table in rag_con.list_tables():
    if "memtable" in table:
        rag_con.drop_view(table)

rag_con.list_tables()

1: Define the table name.
2: Walk through the docs/ directory and find all .qmd files.
3: Read the contents of each file.
4: Create a dictionary with the file path and contents.
5: Create an Ibis memtable from the dictionary.
6: If the table doesn’t exist, create it and insert the data.
7: Otherwise, insert the data into the existing table.
8: Drop any memtables that were created.
9: List the tables in the database to show that the ibis table was created.

['ibis']

Let’s inspect the table:

t = rag_con.table(table_name)
t

1: Get the ibis table from the DuckDB database.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                               ┃ contents                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                                 │ string                                                                                    │
├────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ docs/why.qmd                           │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf …    │
│ docs/404.qmd                           │ # Page not found\n\nThe page you requested cannot be found (perhaps it was moved …        │
│ docs/support_matrix.qmd                │ ---\ntitle: "Operation support matrix"\nformat: dashboard\nhide:\n  - toc\n---\n\n## {…   │
│ docs/index.qmd                         │ ---\ntitle: "Ibis"\ndescription: "the portable Python dataframe library"\nrepo-act…       │
│ docs/install.qmd                       │ ---\nhide:\n  - toc\n  - navigation\n  - footer\n---\n\n# Installation\n\n{{< include ./… │
│ docs/getting-started.qmd               │ ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …     │
│ docs/posts.qmd                         │ ---\ntitle: "Posts"\nlisting:\n  contents: posts\n  sort: "date desc"\n  image-place…     │
│ docs/presentations.qmd                 │ ---\ntitle: "Presentations"\nsidebar: presentations\n---\n\nIbis presentations are l…     │
│ docs/posts/bigquery-arrays/index.qmd   │ ---\ntitle: Working with arrays in Google BigQuery\nauthor: "Phillip Cloud"\ndate:…       │
│ docs/posts/website-to-quarto/index.qmd │ ---\ntitle: "Migrating Ibis documentation to Quarto"\nauthor: "Cody"\ndate: "2023-…       │
│ …                                      │ …                                                                                         │
└────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Now we have a DuckDB database with the ibis table containing all the file paths and contents of the Quarto markdown files in the ibis repository. You can adjust this code to work with any repository and any file type(s)! For instance, we may also want to ingest all the .py files in the Ibis repsoitory.

Embedding text to vectors

To search for similar text in traditional RAG fashion, we need to embed the text into vectors. We can use OpenAI’s embedding models to perform this step.

Note that the maximum length of input tokens is 8,191. A word typically consists of 1-3 tokens. To estimate the number of tokens of a string in Python, we can divide the length of that string by 4. This isn’t perfect, but will work for now. Let’s augmented our data with the estimated number of tokens and sort from largest to smallest:

t = (
    t.mutate(tokens_estimate=t["contents"].length() // 4)
    .order_by(ibis._["tokens_estimate"].desc())
    .relocate("filepath", "tokens_estimate")
)
t

1: Augment the table with the estimated number of tokens.
2: Sort the table by the estimated number of tokens in descending order.
3: Relocate the filepath and tokens_estimate columns to the front of the table.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                                      ┃ tokens_estimate ┃ contents                                                                             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                                        │ int64           │ string                                                                               │
├───────────────────────────────────────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤
│ docs/posts/ibis-analytics/index.qmd           │           14401 │ ---\ntitle: Modern, hybrid, open analytics\nauthor: "Cody"\ndate: "2024-01-25"\nima… │
│ docs/backends/impala.qmd                      │            7677 │ # Impala\n\n[https://impala.apache.org](https://impala.apache.org)\n\n![](https://i… │
│ docs/tutorials/ibis-for-sql-users.qmd         │            6777 │ # Tutorial: Ibis for SQL users\n\n## Prerequisites\n\n{{< include ../_tabsets/insta… │
│ docs/posts/lms-for-data/index.qmd             │            5508 │ ---\ntitle: "Using language models for data"\nauthor: "Cody Peterson"\ndate: "2024…  │
│ docs/posts/pydata-performance/index.qmd       │            4801 │ ---\ntitle: "Ibis versus X: Performance across the ecosystem part 1"\nauthor: "Ph…   │
│ docs/tutorials/ibis-for-pandas-users.qmd      │            4455 │ # Tutorial: Ibis for pandas users\n\n## Prerequisites\n\n{{< include ../_tabsets/in… │
│ docs/posts/campaign-finance/index.qmd         │            4241 │ ---\ntitle: "Exploring campaign finance data"\nauthor: "Nick Crews"\ndate: "2023-0…  │
│ docs/posts/duckdb-for-rag/index.qmd           │            3951 │ ---\ntitle: "Using DuckDB + Ibis for RAG"\nauthor: "Cody Peterson"\ndate: "2024-02…  │
│ docs/posts/backend-agnostic-arrays/index.qmd  │            3471 │ ---\ntitle: Backend agnostic arrays\nauthor: "Phillip Cloud"\ndate: 2024-01-19\ncat… │
│ docs/posts/ibis_substrait_to_duckdb/index.qmd │            3323 │ ---\ntitle: "Ibis + Substrait + DuckDB"\nauthor: Gil Forsyth\ndate: 2023-02-01\ncat… │
│ …                                             │               … │ …                                                                                    │
└───────────────────────────────────────────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘

The longest text is certainly over our embedding model’s token limit, and the second longest might be as well. In practice, we’d want to chunk our text into smaller pieces. This is its own challenge, but for the purposes of this demonstration we’ll just ignore text that is too long.

Let’s define our embedding functions:

def _embed(text: str) -> list[float]:
    """Text to fixed-length array embedding."""

    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    client = OpenAI()

    try:
        return (
            client.embeddings.create(input=[text], model=model).data[0].embedding
        )
    except Exception as e:
        print(e)
    return None


@ibis.udf.scalar.python
def embed(text: str, tokens_estimate: int) -> list[float]:
    """Text to fixed-length array embedding."""

    if 0 < tokens_estimate < 8191:
        return _embed(text)
    return None

1: Define a regular Python function to embed text.
2: Define the embedding model to use.
3: Replace newlines with spaces.
4: Create an OpenAI client.
5: Since it’s an external API call, we need to handle exceptions.
6: Call the OpenAI API to embed the text.
7: Handle exceptions and return None if an error occurs.
8: Define an Ibis UDF using the Python function.
9: If the estimated number of tokens is within the limit…
10: …call the Python function to embed the text.
11: Otherwise, return None.

Let’s embed the text:

t = t.mutate(
    embedding=embed(t["contents"], t["tokens_estimate"])
).cache()
t

1: Augment the table with the embedded text and cache the result.

Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8413 tokens (8413 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ filepath                                      ┃ tokens_estimate ┃ contents                                                                             ┃ embedding                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                                        │ int64           │ string                                                                               │ array<float64>                                              │
├───────────────────────────────────────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────┤
│ docs/posts/ibis-analytics/index.qmd           │           14401 │ ---\ntitle: Modern, hybrid, open analytics\nauthor: "Cody"\ndate: "2024-01-25"\nima… │ NULL                                                        │
│ docs/backends/impala.qmd                      │            7677 │ # Impala\n\n[https://impala.apache.org](https://impala.apache.org)\n\n![](https://i… │ NULL                                                        │
│ docs/tutorials/ibis-for-sql-users.qmd         │            6777 │ # Tutorial: Ibis for SQL users\n\n## Prerequisites\n\n{{< include ../_tabsets/insta… │ [-0.032762881368398666, 0.03934009373188019, ... +1534]     │
│ docs/posts/lms-for-data/index.qmd             │            5508 │ ---\ntitle: "Using language models for data"\nauthor: "Cody Peterson"\ndate: "2024…  │ [-0.019190069288015366, 0.041595082730054855, ... +1534]    │
│ docs/posts/pydata-performance/index.qmd       │            4801 │ ---\ntitle: "Ibis versus X: Performance across the ecosystem part 1"\nauthor: "Ph…   │ [0.00992082990705967, 0.00490616587921977, ... +1534]       │
│ docs/tutorials/ibis-for-pandas-users.qmd      │            4455 │ # Tutorial: Ibis for pandas users\n\n## Prerequisites\n\n{{< include ../_tabsets/in… │ [-0.007494321092963219, 0.028004156425595284, ... +1534]    │
│ docs/posts/campaign-finance/index.qmd         │            4241 │ ---\ntitle: "Exploring campaign finance data"\nauthor: "Nick Crews"\ndate: "2023-0…  │ [-0.01551054883748293, 0.01505258958786726, ... +1534]      │
│ docs/posts/duckdb-for-rag/index.qmd           │            3951 │ ---\ntitle: "Using DuckDB + Ibis for RAG"\nauthor: "Cody Peterson"\ndate: "2024-02…  │ [-0.018623681738972664, 0.05166473984718323, ... +1534]     │
│ docs/posts/backend-agnostic-arrays/index.qmd  │            3471 │ ---\ntitle: Backend agnostic arrays\nauthor: "Phillip Cloud"\ndate: 2024-01-19\ncat… │ [-0.0014422409003600478, -0.0041314102709293365, ... +1534] │
│ docs/posts/ibis_substrait_to_duckdb/index.qmd │            3323 │ ---\ntitle: "Ibis + Substrait + DuckDB"\nauthor: Gil Forsyth\ndate: 2023-02-01\ncat… │ [-0.02641780860722065, 0.06054646894335747, ... +1534]      │
│ …                                             │               … │ …                                                                                    │ …                                                           │
└───────────────────────────────────────────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────┘

Notice there was one error – that second row’s content was indeed too long! Our token estimate is a bit off, but overall useful. We accounted for this possibility in our Python UDF above, returning None (mapped to NULL in DuckDB) when the text is too long or an error occurs.

Searching with cosine similarity

Cosine similarity over vectors is a common way to measure similarity for RAG applications. We’ll explore its downsides later. We need to make Ibis aware of DuckDB’s built-in list_cosine_similarity function:

@ibis.udf.scalar.builtin
def list_cosine_similarity(x, y) -> float:
    """Compute cosine similarity between two vectors."""

1: Use an Ibis builtin UDF to define a cosine similarity function.

Tip

If using a different backend, you’ll need to use its corresponding cosine similarity function or define your own.

Now we can search for similar text in the documentation:

def search_docs(text):
    """Search documentation for similar text, returning a sorted table"""

    embedding = _embed(text)

    s = (
        t.mutate(similarity=list_cosine_similarity(t["embedding"], embedding))
        .relocate("similarity")
        .order_by(ibis._["similarity"].desc())
        .cache()
    )

    return s

1: Define a function to search the documentation.
2: Embed the input text.
3: Augment the table with the cosine similarity between the embedded text and the input text.
4: Relocate the similarity column to the front of the table.
5: Sort the table by the cosine similarity in descending order.
6: Cache the result.

text = "where can I chat with the community about Ibis?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                                     ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                                       │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼──────────────────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.661852 │ docs/posts/zulip/index.qmd                                   │             440 │ ---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…    │ [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.578235 │ docs/concepts/who.qmd                                        │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.569246 │ docs/why.qmd                                                 │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.560286 │ docs/concepts/user-testimonials.qmd                          │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.551246 │ docs/getting-started.qmd                                     │             157 │ ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …  │ [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.527626 │ docs/contribute/02_workflow.qmd                              │             574 │ # Contribute to the Ibis codebase\n\n## Getting started\n\nFirst, set up a [develop…   │ [-0.01897418312728405, 0.05570495128631592, ... +1534]    │
│   0.522431 │ docs/posts/ibis-version-4.0.0-release/index.qmd              │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.519414 │ docs/posts/Ibis-version-3.0.0-release/index.qmd              │            2038 │ ---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n  …  │ [-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.517491 │ docs/tutorials/open-source-software/apache-flink/0_setup.qmd │             425 │ # Getting started\n\nIn this tutorial, you will learn how to set up and use Flink…     │ [-0.049202971160411835, 0.037027861922979355, ... +1534]  │
│   0.510845 │ docs/_code/setup_penguins.qmd                                │              85 │ ```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…   │ [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│          … │ …                                                            │               … │ …                                                                                      │ …                                                         │
└────────────┴──────────────────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

Now that we have retrieved the most similar documentation, we can augment our language model’s input with that context prior to generating a response! In practice, we’d probably want to set a similarity threshold and take the top N results. Chunking our text into smaller pieces and selecting from those results would also be a good idea.

Let’s try a few more queries:

text = "what do users say about Ibis?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.574682 │ docs/concepts/user-testimonials.qmd             │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.562258 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.544948 │ docs/concepts/who.qmd                           │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.528427 │ docs/posts/zulip/index.qmd                      │             440 │ ---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…    │ [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.523075 │ docs/getting-started.qmd                        │             157 │ ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …  │ [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.516679 │ docs/posts/ibis-version-4.0.0-release/index.qmd │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.500733 │ docs/_tabsets/install_default.qmd               │             127 │ We recommend starting with the default (DuckDB) backend for a performant, fully…       │ [-0.004252271726727486, 0.025945203378796577, ... +1534]  │
│   0.497463 │ docs/how-to/configure/basics.qmd                │             611 │ # Basic configuration\n\nIf you don't have your own data, you can load example da…     │ [-0.02095085009932518, 0.04434989020228386, ... +1534]    │
│   0.491889 │ docs/posts/ibis-to-file/index.qmd               │             700 │ ---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…    │ [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.484358 │ docs/_code/setup_penguins.qmd                   │              85 │ ```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…   │ [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

text = "can Ibis complete the one billion row challenge?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.608117 │ docs/posts/1brc/index.qmd                       │            2129 │ ---\ntitle: "Using one Python dataframe API to take the billion row challenge wi…      │ [0.013509240001440048, -0.0003828565531875938, ... +1534] │
│   0.504276 │ docs/concepts/who.qmd                           │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.482052 │ docs/presentations/overview.qmd                 │            1697 │ ---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…   │ [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.473564 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.468842 │ docs/posts/Ibis-version-3.0.0-release/index.qmd │            2038 │ ---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n  …  │ [-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.465956 │ docs/tutorials/data-platforms/clickhouse.qmd    │             826 │ ---\ntitle: ClickHouse\nfreeze: auto\n---\n\n[Ibis](https://ibis-project.com) suppor…  │ [-0.027811484411358833, 0.02419976517558098, ... +1534]   │
│   0.465920 │ docs/concepts/user-testimonials.qmd             │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.456237 │ docs/_code/setup_penguins.qmd                   │              85 │ ```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…   │ [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│   0.455206 │ docs/how-to/visualization/seaborn.qmd           │             159 │ # seaborn + Ibis\n\nIf you don't have data to visualize, you can load an example …     │ [-0.012414666824042797, 0.012007818557322025, ... +1534]  │
│   0.453313 │ docs/how-to/input-output/duckdb-parquet.qmd     │             651 │ ---\ntitle: Read parquet files with Ibis\n---\n\nIn this example, we will use Ibis'…   │ [0.01082119531929493, 0.04467007517814636, ... +1534]     │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

text = "teach me about the Polars backend"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.597506 │ docs/backends/polars.qmd                        │             402 │ # Polars\n\n[https://www.pola.rs](https://www.pola.rs)\n\n{{< include /_callouts/ex…   │ [-0.04319469630718231, 0.030947020277380943, ... +1534]   │
│   0.396795 │ docs/posts/pydata-performance-part2/index.qmd   │            1567 │ ---\ntitle: "Ibis versus X: Performance across the ecosystem part 2"\nauthor: "Ph…     │ [-0.0013977460330352187, 0.028734903782606125, ... +1534] │
│   0.364785 │ docs/posts/ibis-version-4.0.0-release/index.qmd │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.356085 │ docs/_callouts/experimental_backend.qmd         │              26 │ ::: {.callout-warning}\nThis backend is experimental and is subject to backwards…      │ [0.017363416031003, 0.03505472466349602, ... +1534]       │
│   0.326378 │ docs/how-to/input-output/multiple-backends.qmd  │             331 │ # Work with multiple backends\n\nYou can work with multiple backends by creating …     │ [-0.014367456547915936, 0.01823212020099163, ... +1534]   │
│   0.324941 │ docs/presentations/overview.qmd                 │            1697 │ ---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…   │ [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.315046 │ docs/backends/_templates/api.qmd                │             121 │ ```{python}\n#| echo: false\n#| output: asis\n\nfrom _utils import get_backend, ren…   │ [-0.013808815740048885, 0.03160477802157402, ... +1534]   │
│   0.311201 │ docs/backends/postgresql.qmd                    │             493 │ # PostgreSQL\n\n[https://www.postgresql.org](https://www.postgresql.org)\n\n![](htt…   │ [-0.046312808990478516, 0.031841542571783066, ... +1534]  │
│   0.300936 │ docs/posts/ibis-version-8.0.0-release/index.qmd │            2062 │ ---\ntitle: "Ibis 8.0: streaming and more!"\nauthor: "Ibis team"\ndate: "2024-02-1…    │ [0.005049674306064844, 0.03530928120017052, ... +1534]    │
│   0.298417 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

text = "why does Voltron Data support Ibis?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.679104 │ docs/concepts/who.qmd                           │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.616018 │ docs/posts/why-voda-supports-                   │            3056 │ ---\ntitle: "Why Voltron Data supports Ibis"\nauthor: "Cody Peterson + Ian Cook"\n…    │ [-0.0027232086285948753, 0.0460839718580246, ... +1534]   │
│   0.582328 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.536979 │ docs/_code/input_output_penguins.qmd            │             947 │ ## Data platforms\n\nYou can connect Ibis to any supported backend to read and wr…     │ [-0.009215430356562138, 0.020675163716077805, ... +1534]  │
│   0.528810 │ docs/concepts/user-testimonials.qmd             │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.523802 │ docs/presentations/overview.qmd                 │            1697 │ ---\ntitle: "Ibis overview"\nformat:\n  revealjs:\n    footer: <https://ibis-projec…   │ [-0.0015606271335855126, 0.019388146698474884, ... +1534] │
│   0.516846 │ docs/posts/ibis-version-4.0.0-release/index.qmd │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.515140 │ docs/concepts/composable-ecosystem.qmd          │            1490 │ # Composable data ecosystem\n\nIbis exists in a broader composable data ecosystem…     │ [-0.0018551592947915196, 0.0542871318757534, ... +1534]   │
│   0.513805 │ docs/backends/trino.qmd                         │             743 │ # Trino\n\n[https://trino.io](https://trino.io)\n\n{{< include /_callouts/experimen…   │ [-0.04838176444172859, 0.04063660651445389, ... +1534]    │
│   0.500496 │ docs/how-to/input-output/basics.qmd             │             105 │ # Basic input/output\n\nIf you don't have your own data, you can load example dat…     │ [-0.00725388852879405, 0.047017041593790054, ... +1534]   │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

text = "why should I use Ibis?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.621761 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.589931 │ docs/concepts/who.qmd                           │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.580506 │ docs/_tabsets/install_default.qmd               │             127 │ We recommend starting with the default (DuckDB) backend for a performant, fully…       │ [-0.004252271726727486, 0.025945203378796577, ... +1534]  │
│   0.564820 │ docs/concepts/user-testimonials.qmd             │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.563156 │ docs/getting-started.qmd                        │             157 │ ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …  │ [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│   0.549894 │ docs/_code/setup_penguins.qmd                   │              85 │ ```{python}\nimport ibis  # <1>\nimport ibis.selectors as s  # <1>\n\nibis.options.…   │ [-0.009995625354349613, 0.014497287571430206, ... +1534]  │
│   0.545518 │ docs/how-to/configure/basics.qmd                │             611 │ # Basic configuration\n\nIf you don't have your own data, you can load example da…     │ [-0.02095085009932518, 0.04434989020228386, ... +1534]    │
│   0.542379 │ docs/posts/ibis-version-4.0.0-release/index.qmd │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.539636 │ docs/posts/ibis-to-file/index.qmd               │             700 │ ---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…    │ [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.534821 │ docs/_callouts/pypi_warning.qmd                 │              59 │ ::: {.callout-warning}\nNote that the `ibis-framework` package is _not_ the same…      │ [0.0019332439405843616, 0.01002512127161026, ... +1534]   │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

text = "what is the Ibis roadmap?"
search_docs(text)

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ similarity ┃ filepath                                        ┃ tokens_estimate ┃ contents                                                                               ┃ embedding                                                 ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64    │ string                                          │ int64           │ string                                                                                 │ array<float64>                                            │
├────────────┼─────────────────────────────────────────────────┼─────────────────┼────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│   0.569501 │ docs/posts/roadmap-2024-H1/index.qmd            │            2958 │ ---\ntitle: "Ibis project 2024 roadmap"\nauthor: "Cody Peterson"\ndate: "2024-02-1…    │ [-0.011533920653164387, 0.07820627093315125, ... +1534]   │
│   0.563448 │ docs/why.qmd                                    │             731 │ ---\ntitle: "Why Ibis?"\n---\n\nIbis is the portable Python dataframe library.\n\nIf … │ [-0.00048549860366620123, 0.05761361122131348, ... +1534] │
│   0.555909 │ docs/concepts/who.qmd                           │             800 │ # Who supports Ibis?\n\nIbis is an open-source project that welcomes contribution…     │ [-0.006300679408013821, 0.05288177356123924, ... +1534]   │
│   0.544127 │ docs/posts/ibis-version-4.0.0-release/index.qmd │             743 │ ---\ntitle: Ibis v4.0.0\nauthor: Patrick Clarke\ndate: 2023-01-09\ncategories:\n    …  │ [-0.017321649938821793, 0.05772090330719948, ... +1534]   │
│   0.524721 │ docs/posts/zulip/index.qmd                      │             440 │ ---\ntitle: "Announcing Zulip for Ibis community chat"\nauthor: "Ibis team"\ndate:…    │ [0.00033197790617123246, 0.05603647977113724, ... +1534]  │
│   0.514110 │ docs/concepts/user-testimonials.qmd             │            1079 │ # User testimonials\n\nThis page collects user testimonials about Ibis from the c…     │ [-0.013335400260984898, 0.034098923206329346, ... +1534]  │
│   0.510652 │ docs/posts/Ibis-version-3.0.0-release/index.qmd │            2038 │ ---\ntitle: Ibis v3.0.0\nauthor: Marlene Mhangami\ndate: 2022-04-25\ncategories:\n  …  │ [-0.02136301063001156, 0.03890605643391609, ... +1534]    │
│   0.507567 │ docs/posts/ibis-to-file/index.qmd               │             700 │ ---\ntitle: "Ibis sneak peek: writing to files"\nauthor: Kae Suarez\ndate: 2023-03…    │ [0.02339361608028412, 0.07014117389917374, ... +1534]     │
│   0.505812 │ docs/posts/why-voda-supports-                   │            3056 │ ---\ntitle: "Why Voltron Data supports Ibis"\nauthor: "Cody Peterson + Ian Cook"\n…    │ [-0.0027232086285948753, 0.0460839718580246, ... +1534]   │
│   0.501638 │ docs/getting-started.qmd                        │             157 │ ---\ntitle: "Getting started"\nsidebar: getting-started\n---\n\nWelcome to the Ibis …  │ [-0.035996999591588974, 0.03943954035639763, ... +1534]   │
│          … │ …                                               │               … │ …                                                                                      │ …                                                         │
└────────────┴─────────────────────────────────────────────────┴─────────────────┴────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

RAG pitfalls

As in traditional software engineering and machine learning, the architecture when building systems around language models is not a universal. The best depends on the application of the system. Using cosine (or otherwise) similarity search over text transformed to numbers may not always perform best.

We can demonstrate this with the following scenario: let’s assume we’re building a chatbot that writes and runs SQL on the user’s behalf based on their questions posed in English (or any language the language model understand sufficiently). We might store past queries, the SQL generated, and some evaluation criteria (upvote/downvote on the generated SQL from the user) in a database. Then we can use the same RAG approach above to augment our queries with the user’s past queries and the generated SQL.

Take the following English query:

a = """the first 10 rows of the penguins, ordered by body_mass_g from
lightest to heaviest"""

We can compute the cosine similarity of this query against itself, resulting in 1.0 or close to it:

list_cosine_similarity(_embed(a), _embed(a))

0.9999994546885027

Now, let’s flip the ordering. This would result in an entirely different SQL query and result set:

b = """the first 10 rows of the penguins, ordered by body_mass_g from
heaviest to lightest"""

And compute the cosine similarity of a and b:

list_cosine_similarity(_embed(a), _embed(b))

0.9934337761234602

Similarly, we can construct a semantically equivalent query that would result in the same SQL as example a:

c = """of the birds retrieve the initial ten opposite-of-columns,
sorted from biggest to smallest by weight"""

That has a much lower cosine similarity in the embedding space:

list_cosine_similarity(_embed(a), _embed(c))

0.5682471739636249

Despite semantic equivalence, these two queries would not be considered similar. You need to be careful and understand what the system is doing.

Discussion

Is traditional RAG the best approach? It’s unclear, and it depends.

System architecture

Often RAG-based systems introduce yet-another-database to store the embeddings. Hopefully this post has demonstrated that traditional database options can work just fine. While we’re only demonstrating this on a few rows here, DuckDB can scale up to many millions with ease.

Cost and speed

The cost of running this demo from OpenAI is less than $0.01 USD. It also runs in about a minute, but with many rows this would be much slower. You need to consider your architecture here. One option would be using an open source embedding model that can run locally. Swap out the OpenAI calls with that and see how it goes!

Alternatives to vector search RAG

It’s clear that we should always be augmenting our language model calls with the most relevant context, even if traditional vector search RAG is not the best solution. The problem then becomes the best way to retrieve that context for a given system’s desired behavior.

Abolish RAG as terminology?

“I would like to abolish the term RAG and instead just agree that we should always try to provide models with the appropriate context to provide high quality answers.” - Hamel Husain

Some alternatives to vector search RAG include:

traditional search methods
using an external web search engine
using a language model for search

The last point is particularly interesting – in the above examples with a and c strings, cosine similarity in the embedding space could not tell the queries are semantically equivalent. However, a sufficiently good language model certainly could. As we demonstrated in language models for data, we could even setup a classifier and use the logit bias trick to compute similarity with a single output token.

This trick makes the language model faster, but for practical applications the cost and speed of language models for search is still a concern. We can expect this concern to be mitigated over time as language models become faster and cheaper to run.

With larger context lengths, we could also use an initial call to a language model to extract out relevant sections of text and then make a second call with that augmented context.

Next steps

Try this out yourself! All you need is an OpenAI account and the code above.

It’s never been a better time to get involved with Ibis. Join us on Zulip and introduce yourself! More Ibis + language model content coming soon.