Varchar in a haystack

blog
data analysis
puzzle
Author

Tyler White

Published

April 12, 2024

The scenario

You’re a data analyst, and a new ticket landed in your queue.

Subject: Urgent: Data Discovery Needed for Critical Analysis

Hi Data Team,

I hope this message finds you well. I’m reaching out with an urgent request that directly impacts the company’s most critical project. We need to locate a specific value within our database but do not know which column it’s in. Unfortunately, we don’t have documentation for this particular table. We are looking for the value “NEEDLE” in the table.

We think it is in the X database, Y schema, and Z table. We appreciate your help with this urgent matter!

Whelp, let’s give this a try.

The table

To set up this particular problem, we can use pandas to create a table with 5 columns and 100 rows. We can use the at method to update a row with the value “NEEDLE” to simulate what we need to find.

Code
import pandas as pd
import random
import string
from ibis.interactive import *


def random_string(length=10):
    return "".join(
        random.choice(string.ascii_letters + string.digits) for _ in range(length)
    )


data = [[random_string() for _ in range(5)] for _ in range(100)]
column_names = [f"col{i+1}" for i in range(5)]
df = pd.DataFrame(data, columns=column_names)
df.at[42, 'col4'] = "NEEDLE"
t = ibis.memtable(df, name="Z")
t
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ col1        col2        col3        col4        col5       ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ stringstringstringstringstring     │
├────────────┼────────────┼────────────┼────────────┼────────────┤
│ BYjQTOia1fpik8UBUQ06JUhXgdJGCn46O1svpUSWhr0dmVIOkU │
│ e973mLNHa1eYWTWYGQiKqfiMKFjwIuKvtiH1nDL30MXkme4lzG │
│ ZeyJBlraO89mk8NSwZBHcHndFLt80MurHnrgqtfAkbVxhOiSD0 │
│ hRodaFvAESEdTEcamHEVEUZp5YpgNS46R7bYiiqaDIV8BWncTc │
│ Rm8IdoOhlhLrrnf46MJdlNVB18l0C7fLTaFvH1WeORIdNoFqbB │
│ wUZRdFYUMooWovrqqovmloW7UhXQVEQ5dXxPfaDAd11VR54kbJ │
│ jKNEfRoDXKvOlfswJ9HkpFDTRA6tEfAqmI3RyjJWUawVq7XqaB │
│ h1RqXsGL5cvypyFxZWoPzzMYrytrb60bbc0pDxQincRR7A10sT │
│ CHCKfLzLV0eVv8jtfcV0kKrVpaLs8iHeUfW0VABs5XyaKrpYk5 │
│ 1FVJbxQUQts9p2LiMYf9two9V1WZpaY70W0uPLdkgflm7xvysr │
│           │
└────────────┴────────────┴────────────┴────────────┴────────────┘

The solution(s)

There are a few ways we could solve this.

Option 1: write SQL

We could always spell it out with SQL, including each column that we want to check in the WHERE clause. In this scenario, we know each column is a varchar, so we can check each one for the value “NEEDLE”.

SELECT *
FROM Z
WHERE col1 = 'NEEDLE'
   OR col2 = 'NEEDLE'
   OR col3 = 'NEEDLE'
   OR col4 = 'NEEDLE'
   OR col5 = 'NEEDLE';

This can be time-consuming. You might want something a little more dynamic.

Option 2: write dynamic SQL

Dynamically constructing the SQL query at runtime can be more complex, but it offers more flexibility, especially if we have more than five columns.

DO $$
DECLARE
    sql text;
    where_clause text := '';
BEGIN
    SELECT INTO where_clause
           string_agg(quote_ident(column_name) || ' = ''NEEDLE''', ' OR ')
    FROM information_schema.columns
    WHERE table_name = 'Z'
        AND table_schema = 'public'
        AND data_type IN ('character varying', 'varchar', 'text', 'char');

    sql := 'SELECT *
            FROM Z
            WHERE ' || where_clause;

    EXECUTE sql;
END $$;

This can be difficult to troubleshoot, and it is easy to get lost in the quote characters.

Option 3: use Ibis

We can make use of selectors!

expr = t.filter(s.if_any(s.of_type("string"), _ == "NEEDLE"))

expr
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ col1        col2        col3        col4    col5       ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ stringstringstringstringstring     │
├────────────┼────────────┼────────────┼────────┼────────────┤
│ oB66pvwIS3evpFlTLzgMOqiRxZW7ExNEEDLEVBxPEgTB8j │
└────────────┴────────────┴────────────┴────────┴────────────┘

We can see the NEEDLE value hiding in col4.

The explanation

s.of_type("string") was used to select string columns, then s.if_any() builds up the ORs. The _ == "NEEDLE" part is the condition itself, checking each column for the value.

Here’s the SQL that was generated to help us find it, which is quite similar to what we would have had to write if we had gone with Option 1.

SELECT
  "t0"."col1",
  "t0"."col2",
  "t0"."col3",
  "t0"."col4",
  "t0"."col5"
FROM "Z" AS "t0"
WHERE
  (
    (
      (
        (
          "t0"."col1" = 'NEEDLE'
        ) OR (
          "t0"."col2" = 'NEEDLE'
        )
      )
      OR (
        "t0"."col3" = 'NEEDLE'
      )
    )
    OR (
      "t0"."col4" = 'NEEDLE'
    )
  )
  OR (
    "t0"."col5" = 'NEEDLE'
  )

The conclusion

Now that we’ve found the "NEEDLE" value, we can provide the information to the requester. Urgent requests like this require quick and precise responses.

Our use of Ibis demonstrates how easy it is to simplify navigating large datasets, and in this case, undocumented ones.

Please get in touch with us on GitHub or Zulip. We’d love to hear from you!

Back to top