Complex Filtering¶
The filtering examples we've shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin
and notin
.
Ibis supports a number of richer analytical filters that can involve one or more of:
- Aggregates computed from the same or other tables
- Conditional aggregates (in SQL-speak these are similar to "correlated subqueries")
- "Existence" set filters (equivalent to the SQL
EXISTS
andNOT EXISTS
keywords)
Setup¶
!curl -LsS -o $TEMPDIR/geography.db 'https://storage.googleapis.com/ibis-tutorial-data/geography.db'
import os
import tempfile
import ibis
ibis.options.interactive = True
connection = ibis.sqlite.connect(
os.path.join(tempfile.gettempdir(), 'geography.db')
)
/nix/store/15c5ssh54syvbiv9waav5gcb2r8n0800-python3-3.10.5-env/lib/python3.10/site-packages/pyproj/__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_LIB or with `pyproj.datadir.set_data_dir`. warnings.warn(str(err))
Using scalar aggregates in filters¶
countries = connection.table('countries')
countries.limit(5)
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | |
---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 4975593 | AS |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 29121286 | AS |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. Johns | 443.0 | 86754 | NA |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA |
We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column. For example the average of countries size:
countries.area_km2.mean()
595287.8555555556
You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access the database multiple times. For example, we want to filter European countries larger than the average country size in the world. See how most countries in Europe are smaller than the world average:
cond = countries.area_km2 > countries.area_km2.mean()
expr = countries[(countries.continent == 'EU') & cond]
expr
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | |
---|---|---|---|---|---|---|---|---|---|
0 | RU | RUS | 643 | RS | Russia | Moscow | 17100000.0 | 140702000 | EU |
1 | UA | UKR | 804 | UP | Ukraine | Kyiv | 603700.0 | 45415596 | EU |
Conditional aggregates¶
Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true.
For example, we want to filter European countries larger than the average country size, but this time of the average in Africa. African countries have an smaller size compared to the world average, and France gets into the list:
conditional_avg = countries[countries.continent == 'AF'].area_km2.mean()
countries[
(countries.continent == 'EU') & (countries.area_km2 > conditional_avg)
]
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | |
---|---|---|---|---|---|---|---|---|---|
0 | FR | FRA | 250 | FR | France | Paris | 547030.0 | 64768389 | EU |
1 | RU | RUS | 643 | RS | Russia | Moscow | 17100000.0 | 140702000 | EU |
2 | UA | UKR | 804 | UP | Ukraine | Kyiv | 603700.0 | 45415596 | EU |
"Existence" filters¶
Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it's nice to be able to express it with a single concise statement and let Ibis compute it optimally.
An example could be finding all countries that had any year with a higher GDP than 3 trillion US dollars:
gdp = connection.table('gdp')
gdp
country_code | year | value | |
---|---|---|---|
0 | ABW | 1986 | 4.054634e+08 |
1 | ABW | 1987 | 4.876025e+08 |
2 | ABW | 1988 | 5.964236e+08 |
3 | ABW | 1989 | 6.953044e+08 |
4 | ABW | 1990 | 7.648871e+08 |
... | ... | ... | ... |
9995 | SVK | 2002 | 3.513034e+10 |
9996 | SVK | 2003 | 4.681659e+10 |
9997 | SVK | 2004 | 5.733202e+10 |
9998 | SVK | 2005 | 6.278531e+10 |
9999 | SVK | 2006 | 7.070810e+10 |
10000 rows × 3 columns
cond = ((gdp.country_code == countries.iso_alpha3) & (gdp.value > 3e12)).any()
countries[cond]['name']
name | |
---|---|
0 | China |
1 | Germany |
2 | United Kingdom |
3 | Japan |
4 | United States |
Note how this is different than a join between countries
and gdp
, which would return one row per year. The method .any()
is equivalent to filtering with a subquery.
Filtering in aggregations¶
Suppose that you want to compute an aggregation with a subset of the data for only one of the metrics / aggregates in question, and the complete data set with the other aggregates. Most aggregation functions are thus equipped with a where
argument. Let me show it to you in action:
arctic = countries.name.isin(
[
'United States',
'Canada',
'Finland',
'Greenland',
'Iceland',
'Norway',
'Russia',
'Sweden',
]
)
metrics = [
countries.count().name('# countries'),
countries.population.sum().name('total population'),
countries.population.sum(where=arctic).name('population arctic countries'),
]
(countries.groupby(countries.continent).aggregate(metrics))
continent | # countries | total population | population arctic countries | |
---|---|---|---|---|
0 | AF | 58 | 1021238685 | NaN |
1 | AN | 5 | 170 | NaN |
2 | AS | 51 | 4130584841 | NaN |
3 | EU | 54 | 750724554 | 161092715.0 |
4 | NA | 42 | 540204371 | 343968238.0 |
5 | OC | 28 | 36067549 | NaN |
6 | SA | 14 | 400143568 | NaN |