Querying every file in every release on the Python Package Index (redux)

blog
Author

Gil Forsyth

Published

November 15, 2023

Seth Larson wrote a great blog post on querying a PyPI dataset to look for trends in the use of memory-safe languages in Python.

Check out Seth’s article for more information on the dataset (and it’s a good read!). It caught our eye because it makes use of DuckDB to clean the data for analysis.

That’s right up our alley here in Ibis land, so let’s see if we can duplicate Seth’s results (and then continue on to plot them!)

Grab the data (locations)

Seth showed (and then safely decomposed) a nested curl statement and that’s always viable – we’re in Python land so why not grab the filenames using urllib3?

import urllib3

url = "https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt"

with urllib3.PoolManager() as http:
    resp = http.request("GET", url)

parquet_files = resp.data.decode().split()
parquet_files
['https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-0.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-1.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-10.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-11.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-12.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-13.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-14.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-2.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-3.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-4.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-5.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-6.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-7.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-8.parquet',
 'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-9.parquet']

Grab the data

Now we’re ready to get started with Ibis!

DuckDB is clever enough to grab only the parquet metadata. This means we can use read_parquet to create a lazy view of the parquet files and then build up our expression without downloading everything beforehand!

import ibis
from ibis import _

ibis.options.interactive = True
1
See https://ibis-project.org/how-to/analytics/chain_expressions.html for docs on the deferred operator!

Create a DuckDB connection:

con = ibis.duckdb.connect()

And load up one of the files (we can run the full query after)!

pypi = con.read_parquet(parquet_files[0], table_name="pypi")
pypi.schema()
ibis.Schema {
  project_name     string
  project_version  string
  project_release  string
  uploaded_on      timestamp
  path             string
  archive_path     string
  size             uint64
  hash             binary
  skip_reason      string
  lines            uint64
  repository       uint32
}

Query crafting

Let’s break down what we’re looking for. As a high-level view of the use of compiled languages, Seth is using file extensions as an indicator that a given filetype is used in a Python project.

The dataset we’re using has every file in every project – what criteria should we use?

We can follow Seth’s lead and look for things:

  1. A file extension that is one of: asm, cc, cpp, cxx, h, hpp, rs, go, and variants of F90, f90, etc… That is, C, C++, Assembly, Rust, Go, and Fortran.
  2. We exclude matches where the file path is within the site-packages/ directory.
  3. We exclude matches that are in directories used for testing.
expr = pypi.filter(
    [
        _.path.re_search(r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"),
        ~_.path.re_search(r"(^|/)test(|s|ing)"),
        ~_.path.contains("/site-packages/"),
    ]
)
expr
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ project_name     project_version  project_release               uploaded_on              path                                                                              archive_path                                                                      size    hash                                                                  skip_reason  lines   repository ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ stringstringstringtimestampstringstringuint64binarystringuint64uint32     │
├─────────────────┼─────────────────┼──────────────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────┼─────────────┼────────┼────────────┤
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/support.c                                 1607b'\xca\x0c\xf2\\R\x83\xefS\x0c\xe4\x0c\x15`\x1fM\x16"\x93\x88\x08'~661 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/stemmer.c                     5054b'y\xc3A\x12\x17\xd4\xeb\xbb\xcfan\xfd\x80\xbac\x18\xcf\xc0W\x9a'~2301 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…313b'\x81s\xa1t\x86}\xf9\xe5\xb5Zt\xcb\xd3\xae\nHfe\x8c\x9d'~161 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…80922b'\xae<\xc7f\x02\xc5{\xc50\xf4\xdc\x8fa\x1a\t..k\xd5\x9d'~22051 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…313b'\x14D\xeb\xb4\x9ac\xab\x14:b\xa4\xba\xa5\x9f\x1f\x06\xce\x0bj\xf2'~161 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…10684b"!\xa259'\x94\xc7.\x16\x0b\x08\x95J\x0e\xef\x86{\x0e\xd6\x8f"~3091 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…313b'\x10W.\xcc7\x08=WV\xde\x1bP9\x03w\x03\xa2\x8c\xe7\xec'~161 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…41620b'\x95P\xd6|\x85\x97\xb2H\x14\xa0d<q-iu\xc1\x98h\xbb'~10971 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…313b'N\xf7t\xdd\xcc\xbb8Y\x0b\xbc\xd5No_\x8d\xc7\xf2\x80\x10\xd0'~161 │
│ zopyx.txng3.ext3.3.2          zopyx.txng3.ext-3.3.2.tar.gz2010-03-06 16:09:43.735packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop…zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8…25440b'o\n\x96M+\xb0\xfbV\xaa6<*\xc8\xb0B\x03\x8a\xa9\xc3\x10'~6941 │
│  │
└─────────────────┴─────────────────┴──────────────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────┴─────────────┴────────┴────────────┘

That could be right – we can peak at the filename at the end of the path column to do a quick check:

expr.path.split("/")[-1]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayIndex(StringSplit(path, '/'), -1) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                                 │
├────────────────────────────────────────┤
│ support.c                              │
│ stemmer.c                              │
│ stem_UTF_8_turkish.h                   │
│ stem_UTF_8_turkish.c                   │
│ stem_UTF_8_swedish.h                   │
│ stem_UTF_8_swedish.c                   │
│ stem_UTF_8_spanish.h                   │
│ stem_UTF_8_spanish.c                   │
│ stem_UTF_8_russian.h                   │
│ stem_UTF_8_russian.c                   │
│                                       │
└────────────────────────────────────────┘

Ok! Next up, we want to group the matches by:

  1. The month that the package / file was published For this, we can use the truncate method and ask for month as our truncation window.
  2. The file extension of the file used
expr.group_by(
    month=_.uploaded_on.truncate("M"),
    ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
).aggregate()
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ month                ext    ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ timestampstring │
├─────────────────────┼────────┤
│ 2015-12-01 00:00:00cpp    │
│ 2015-11-01 00:00:00h      │
│ 2015-12-01 00:00:00f90    │
│ 2015-11-01 00:00:00hpp    │
│ 2010-11-01 00:00:00c      │
│ 2010-07-01 00:00:00h      │
│ 2010-12-01 00:00:00cpp    │
│ 2010-03-01 00:00:00h      │
│ 2011-08-01 00:00:00cpp    │
│ 2010-05-01 00:00:00h      │
│       │
└─────────────────────┴────────┘

That looks promising. Now we need to grab the package names that correspond to a given file extension in a given month and deduplicate it. And to match Seth’s results, we’ll also sort by the month in descending order:

expr = (
    expr.group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
    )
    .aggregate(projects=_.project_name.collect().unique())
    .order_by(_.month.desc())
)

expr
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month                ext     projects                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestampstringarray<string>                                       │
├─────────────────────┼────────┼─────────────────────────────────────────────────────┤
│ 2017-07-01 00:00:00c     ['newrelic', 'nuclitrack', ... +262]                │
│ 2017-07-01 00:00:00asm   ['pwntools', 'fibers', ... +6]                      │
│ 2017-07-01 00:00:00rs    ['rust-pypi-example', 'tokio', ... +2]              │
│ 2017-07-01 00:00:00f     ['okada-wrapper', 'numpy', ... +6]                  │
│ 2017-07-01 00:00:00cpp   ['pipcudemo', 'pyDEM', ... +108]                    │
│ 2017-07-01 00:00:00f90   ['pySpecData', 'numpy', ... +8]                     │
│ 2017-07-01 00:00:00cxx   ['pytetgen', 'python-libsbml-experimental', ... +8] │
│ 2017-07-01 00:00:00go    ['pre-commit', 'django-instant', ... +5]            │
│ 2017-07-01 00:00:00cc    ['nixio', 'pogeo', ... +14]                         │
│ 2017-07-01 00:00:00h     ['numba', 'p4d', ... +222]                          │
│                                                    │
└─────────────────────┴────────┴─────────────────────────────────────────────────────┘

Massage and plot

Let’s continue and see what our results look like.

We’ll do a few things:

  1. Combine all of the C and C++ extensions into a single group by renaming them all.
  2. Count the number of distinct entries in each group
  3. Plot the results!
collapse_names = expr.mutate(
    ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
    .re_replace("^f.*$", "Fortran")
    .replace("rs", "Rust")
    .replace("go", "Go")
    .replace("asm", "Assembly")
    .nullif(""),
).dropna("ext")

collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month                ext       projects                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestampstringarray<string>                                       │
├─────────────────────┼──────────┼─────────────────────────────────────────────────────┤
│ 2017-07-01 00:00:00C/C++   ['pipcudemo', 'pyDEM', ... +108]                    │
│ 2017-07-01 00:00:00Fortran ['numpy', 'pySpecData', ... +8]                     │
│ 2017-07-01 00:00:00Go      ['pre-commit', 'ronin', ... +5]                     │
│ 2017-07-01 00:00:00C/C++   ['pytetgen', 'python-libsbml-experimental', ... +8] │
│ 2017-07-01 00:00:00C/C++   ['newrelic', 'nuclitrack', ... +262]                │
│ 2017-07-01 00:00:00Fortran ['okada-wrapper', 'numpy', ... +6]                  │
│ 2017-07-01 00:00:00Assembly['pwntools', 'xmldirector.plonecore', ... +6]       │
│ 2017-07-01 00:00:00Rust    ['rust-pypi-example', 'tokio', ... +2]              │
│ 2017-07-01 00:00:00C/C++   ['numba', 'numpythia', ... +222]                    │
│ 2017-07-01 00:00:00C/C++   ['pyemd', 'pogeo', ... +19]                         │
│                                                    │
└─────────────────────┴──────────┴─────────────────────────────────────────────────────┘

Note that now we need to de-duplicate again, since we might’ve had separate unique entries for both an h and c file extension, and we don’t want to double-count!

We could rewrite our original query and include the renames in the original group_by (this would be the smart thing to do), but let’s push on and see if we can make this work.

The projects column is now a column of string arrays, so we want to collect all of the arrays in each group, this will give us a “list of lists”, then we’ll flatten that list and call unique().length() as before.

DuckDB has a flatten function, but it isn’t exposed in Ibis (yet!).

We’ll use a handy bit of Ibis magic to define a builtin UDF that will map directly onto the underlying DuckDB function (what!? See here for more info):

@ibis.udf.scalar.builtin
def flatten(x: list[list[str]]) -> list[str]:
    ...


collapse_names = collapse_names.group_by(["month", "ext"]).aggregate(
    projects=flatten(_.projects.collect())
)

collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month                ext       projects                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestampstringarray<string>                               │
├─────────────────────┼──────────┼─────────────────────────────────────────────┤
│ 2009-07-01 00:00:00C/C++   ['gevent', 'hashlib', ... +52]              │
│ 2009-07-01 00:00:00Assembly['pycryptopp']                              │
│ 2008-10-01 00:00:00Fortran ['numscons']                                │
│ 2008-08-01 00:00:00Fortran ['numscons']                                │
│ 2008-06-01 00:00:00C/C++   ['dm.incrementalsearch', 'Cython', ... +45] │
│ 2008-05-01 00:00:00Fortran ['numscons']                                │
│ 2007-03-01 00:00:00Fortran ['Model-Builder']                           │
│ 2005-05-01 00:00:00C/C++   ['ll-xist', 'll-xist']                      │
│ 2005-03-01 00:00:00C/C++   ['pygenx', 'pygenx']                        │
│ 2011-08-01 00:00:00Fortran ['pysces', 'ffnet']                         │
│                                            │
└─────────────────────┴──────────┴─────────────────────────────────────────────┘

We could have included the unique().length() in the aggregate call, but sometimes it’s good to check that your slightly off-kilter idea has worked (and it has!).

collapse_names = collapse_names.select(
    _.month, _.ext, project_count=_.projects.unique().length()
)

collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ month                ext      project_count ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ timestampstringint64         │
├─────────────────────┼─────────┼───────────────┤
│ 2007-03-01 00:00:00C/C++  6 │
│ 2006-01-01 00:00:00C/C++  5 │
│ 2005-10-01 00:00:00C/C++  2 │
│ 2011-08-01 00:00:00C/C++  57 │
│ 2011-03-01 00:00:00C/C++  63 │
│ 2011-01-01 00:00:00Fortran2 │
│ 2010-12-01 00:00:00C/C++  48 │
│ 2010-08-01 00:00:00Fortran1 │
│ 2010-07-01 00:00:00Fortran3 │
│ 2010-03-01 00:00:00Fortran1 │
│  │
└─────────────────────┴─────────┴───────────────┘

Now that the data are tidied, we can pass our expression directly to Altair and see what it looks like!

import altair as alt

chart = (
    alt.Chart(collapse_names.to_pandas())
    .mark_line()
    .encode(x="month", y="project_count", color="ext")
    .properties(width=600, height=300)
)
chart

That looks good, but it definitely doesn’t match the plot from Seth’s post:

upstream plot

Our current plot is only showing the results from a subset of the available data. Now that our expression is complete, we can re-run on the full dataset and compare.

The full run

To recap – we pulled a lazy view of a single parquet file from the pypi-data repo, filtered for all the files that contain file extensions we care about, then grouped them all together to get counts of the various filetypes used across projects by month.

Here’s the entire query chained together into a single command, now running on all of the parquet files we have access to:

pypi = con.read_parquet(parquet_files, table_name="pypi")

full_query = (
    pypi.filter(
        [
            _.path.re_search(
                r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
            ),
            ~_.path.re_search(r"(^|/)test(|s|ing)"),
            ~_.path.contains("/site-packages/"),
        ]
    )
    .group_by(
        month=_.uploaded_on.truncate("M"),
        ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
    )
    .aggregate(projects=_.project_name.collect().unique())
    .order_by(_.month.desc())
    .mutate(
        ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
        .re_replace("^f.*$", "Fortran")
        .replace("rs", "Rust")
        .replace("go", "Go")
        .replace("asm", "Assembly")
        .nullif(""),
    )
    .dropna("ext")
    .group_by(["month", "ext"])
    .aggregate(project_count=flatten(_.projects.collect()).unique().length())
)
chart = (
    alt.Chart(full_query.to_pandas())
    .mark_line()
    .encode(x="month", y="project_count", color="ext")
    .properties(width=600, height=300)
)
chart
Back to top