In this example, we will use Ibis’s DuckDB backend to analyze data from a remote parquet source using ibis.read_parquet. ibis.read_parquet can also read local parquet files, and there are other ibis.read_* functions that conveniently return a table expression from a file. One such function is ibis.read_csv, which reads from local and remote CSV.
Note that we’re calling read_parquet and receiving a table expression without establishing a connection first. Ibis spins up a DuckDB connection (or whichever default backend you have) when you call ibis.read_parquet (or even ibis.read_csv).
Since our result, t, is a table expression, we can now run queries against the file using Ibis expressions. For example, we can select columns, filter the file, and then view the first five rows of the result:
We can use read_parquet to read an entire parquet file by globbing all partitions:
t = ibis.read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*")
Since the function returns a table expression, we can perform valid selections, filters, aggregations, and exports just as we could with any other table expression:
df = ( t.select(["gbifid", "family", "species"]) .filter(t["family"].isin(["Corvidae"]))# Here we limit by 10,000 to fetch a quick batch of results .limit(10000) .group_by("species") .count() .to_pandas())df