Read parquet files with Ibis

In this example, we will use Ibis’s DuckDB backend to analyze data from a remote parquet source using ibis.read_parquet. ibis.read_parquet can also read local parquet files, and there are other ibis.read_* functions that conveniently return a table expression from a file. One such function is ibis.read_csv, which reads from local and remote CSV.

We will be reading from the Global Biodiversity Information Facility (GBIF) Species Occurrences dataset. It is hosted on S3 at s3://gbif-open-data-us-east-1/occurrence/

Reading One Partition

We can read a single partition by specifying its path.

We do this by calling read_parquet on the partition we care about.

So to read the first partition in this dataset, we’ll call read_parquet on 00000 in that path:

import ibis

t = ibis.read_parquet(
    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
)
t

DatabaseTable: ibis_read_parquet_7lodnv6b4nhhbml2qitcmxicne
  gbifid                           string
  datasetkey                       string
  occurrenceid                     string
  kingdom                          string
  phylum                           string
  class                            string
  order                            string
  family                           string
  genus                            string
  species                          string
  infraspecificepithet             string
  taxonrank                        string
  scientificname                   string
  verbatimscientificname           string
  verbatimscientificnameauthorship string
  countrycode                      string
  locality                         string
  stateprovince                    string
  occurrencestatus                 string
  individualcount                  int32
  publishingorgkey                 string
  decimallatitude                  float64
  decimallongitude                 float64
  coordinateuncertaintyinmeters    float64
  coordinateprecision              float64
  elevation                        float64
  elevationaccuracy                float64
  depth                            float64
  depthaccuracy                    float64
  eventdate                        timestamp(6)
  day                              int32
  month                            int32
  year                             int32
  taxonkey                         int32
  specieskey                       int32
  basisofrecord                    string
  institutioncode                  string
  collectioncode                   string
  catalognumber                    string
  recordnumber                     string
  identifiedby                     array<string>
  dateidentified                   timestamp(6)
  license                          string
  rightsholder                     string
  recordedby                       array<string>
  typestatus                       array<string>
  establishmentmeans               string
  lastinterpreted                  timestamp(6)
  mediatype                        array<string>
  issue                            array<string>

Note that we’re calling read_parquet and receiving a table expression without establishing a connection first. Ibis spins up a DuckDB connection (or whichever default backend you have) when you call ibis.read_parquet (or even ibis.read_csv).

Since our result, t, is a table expression, we can now run queries against the file using Ibis expressions. For example, we can select columns, filter the file, and then view the first five rows of the result:

cols = [
    "gbifid",
    "datasetkey",
    "occurrenceid",
    "kingdom",
    "phylum",
    "class",
    "order",
    "family",
    "genus",
    "species",
    "day",
    "month",
    "year",
]

t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()

	gbifid	datasetkey	occurrenceid	kingdom	phylum	class	order	family	genus	species	day	month	year
0	2490959635	27268afd-83c0-4299-bb5e-7748e037c196	34C74047-FE31-2DF3-E053-2614A8C00F0D	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	Corvus corone	27	5	2016
1	2490955607	27268afd-83c0-4299-bb5e-7748e037c196	34AFC28E-CA3C-6434-E053-2614A8C0DC99	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	Corvus corone	27	5	2016
2	2490957603	27268afd-83c0-4299-bb5e-7748e037c196	34AFC28E-CA32-6434-E053-2614A8C0DC99	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	Corvus corone	27	5	2016
3	2490955602	27268afd-83c0-4299-bb5e-7748e037c196	34AFC28E-C9E1-6434-E053-2614A8C0DC99	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	Corvus corone	27	5	2016
4	2490957602	27268afd-83c0-4299-bb5e-7748e037c196	34AFC28E-C9E8-6434-E053-2614A8C0DC99	Animalia	Chordata	Aves	Passeriformes	Corvidae	Corvus	Corvus corone	27	5	2016

We can count the rows in the table (partition):

t.count().to_pandas()

Reading all partitions: filter, aggregate, export

We can use read_parquet to read an entire parquet file by globbing all partitions:

t = ibis.read_parquet(
    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
)

Since the function returns a table expression, we can perform valid selections, filters, aggregations, and exports just as we could with any other table expression:

df = (
    t.select(["gbifid", "family", "species"])
    .filter(t["family"].isin(["Corvidae"]))
    # Here we limit by 10,000 to fetch a quick batch of results
    .limit(10000)
    .group_by("species")
    .count()
    .to_pandas()
)
df

	species	CountStar()
0	Corvus splendens	33
1	Pica pica	333
2	Garrulus glandarius	298
3	Corvus albicollis	1337
4	Corvus cornix	314
5	Corvus corone	8
6	Corvus albus	3759
7	Nucifraga caryocatactes	289
8	Coloeus monedula	898
9	Corvus corax	446
10	Corvus capensis	1787
11	Corvus frugilegus	73
12	Perisoreus infaustus	425