Read parquet files with Ibis

In this example, we will use Ibis’s DuckDB backend to analyze data from a remote parquet source using ibis.read_parquet. ibis.read_parquet can also read local parquet files, and there are other ibis.read_* functions that conveniently return a table expression from a file. One such function is ibis.read_csv, which reads from local and remote CSV.

We will be reading from the Global Biodiversity Information Facility (GBIF) Species Occurrences dataset. It is hosted on S3 at s3://gbif-open-data-us-east-1/occurrence/

Reading One Partition

We can read a single partition by specifying its path.

We do this by calling read_parquet on the partition we care about.

So to read the first partition in this dataset, we’ll call read_parquet on 00000 in that path:

import ibis

t = ibis.read_parquet(
    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
)
t
DatabaseTable: ibis_read_parquet_kwikgcy4xjdylll2fpa7mzds4u
  gbifid                           string
  datasetkey                       string
  occurrenceid                     string
  kingdom                          string
  phylum                           string
  class                            string
  order                            string
  family                           string
  genus                            string
  species                          string
  infraspecificepithet             string
  taxonrank                        string
  scientificname                   string
  verbatimscientificname           string
  verbatimscientificnameauthorship string
  countrycode                      string
  locality                         string
  stateprovince                    string
  occurrencestatus                 string
  individualcount                  int32
  publishingorgkey                 string
  decimallatitude                  float64
  decimallongitude                 float64
  coordinateuncertaintyinmeters    float64
  coordinateprecision              float64
  elevation                        float64
  elevationaccuracy                float64
  depth                            float64
  depthaccuracy                    float64
  eventdate                        timestamp(6)
  day                              int32
  month                            int32
  year                             int32
  taxonkey                         int32
  specieskey                       int32
  basisofrecord                    string
  institutioncode                  string
  collectioncode                   string
  catalognumber                    string
  recordnumber                     string
  identifiedby                     array<string>
  dateidentified                   timestamp(6)
  license                          string
  rightsholder                     string
  recordedby                       array<string>
  typestatus                       array<string>
  establishmentmeans               string
  lastinterpreted                  timestamp(6)
  mediatype                        array<string>
  issue                            array<string>

Note that we’re calling read_parquet and receiving a table expression without establishing a connection first. Ibis spins up a DuckDB connection (or whichever default backend you have) when you call ibis.read_parquet (or even ibis.read_csv).

Since our result, t, is a table expression, we can now run queries against the file using Ibis expressions. For example, we can select columns, filter the file, and then view the first five rows of the result:

cols = [
    "gbifid",
    "datasetkey",
    "occurrenceid",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "species",
    "day",
    "month",
    "year",
]

t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()
gbifid datasetkey occurrenceid kingdom phylum class order family genus species day month year
0 2490959635 27268afd-83c0-4299-bb5e-7748e037c196 34C74047-FE31-2DF3-E053-2614A8C00F0D Animalia Chordata Aves Passeriformes Corvidae Corvus Corvus corone 27 5 2016
1 2490955607 27268afd-83c0-4299-bb5e-7748e037c196 34AFC28E-CA3C-6434-E053-2614A8C0DC99 Animalia Chordata Aves Passeriformes Corvidae Corvus Corvus corone 27 5 2016
2 2490957603 27268afd-83c0-4299-bb5e-7748e037c196 34AFC28E-CA32-6434-E053-2614A8C0DC99 Animalia Chordata Aves Passeriformes Corvidae Corvus Corvus corone 27 5 2016
3 2490955602 27268afd-83c0-4299-bb5e-7748e037c196 34AFC28E-C9E1-6434-E053-2614A8C0DC99 Animalia Chordata Aves Passeriformes Corvidae Corvus Corvus corone 27 5 2016
4 2490957602 27268afd-83c0-4299-bb5e-7748e037c196 34AFC28E-C9E8-6434-E053-2614A8C0DC99 Animalia Chordata Aves Passeriformes Corvidae Corvus Corvus corone 27 5 2016

We can count the rows in the table (partition):

t.count().to_pandas()
2081435

Reading all partitions: filter, aggregate, export

We can use read_parquet to read an entire parquet file by globbing all partitions:

t = ibis.read_parquet(
    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
)

Since the function returns a table expression, we can perform valid selections, filters, aggregations, and exports just as we could with any other table expression:

df = (
    t.select(["gbifid", "family", "species"])
    .filter(t["family"].isin(["Corvidae"]))
    # Here we limit by 10,000 to fetch a quick batch of results
    .limit(10000)
    .group_by("species")
    .count()
    .to_pandas()
)
df
species CountStar()
0 Garrulus glandarius 298
1 Corvus splendens 33
2 Perisoreus infaustus 425
3 Corvus cornix 314
4 Corvus corone 8
5 Corvus albus 3759
6 Coloeus monedula 898
7 Corvus albicollis 1337
8 Corvus capensis 1787
9 Corvus corax 446
10 Corvus frugilegus 73
11 Nucifraga caryocatactes 289
12 Pica pica 333
Back to top