Advanced Topics: Additional Filtering¶
The filtering examples we’ve shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin
and notin
.
Ibis supports a number of richer analytical filters that can involve one or more of:
Aggregates computed from the same or other tables
Conditional aggregates (in SQL-speak these are similar to “correlated subqueries”)
“Existence” set filters (equivalent to the SQL
EXISTS
andNOT EXISTS
keywords)
Setup¶
[1]:
import os
import ibis
ibis.options.interactive = True
connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))
Using scalar aggregates in filters¶
[2]:
countries = connection.table('countries')
countries.limit(5)
[2]:
iso_alpha2 iso_alpha3 iso_numeric fips name \
0 AD AND 20 AN Andorra
1 AE ARE 784 AE United Arab Emirates
2 AF AFG 4 AF Afghanistan
3 AG ATG 28 AC Antigua and Barbuda
4 AI AIA 660 AV Anguilla
capital area_km2 population continent
0 Andorra la Vella 468.0 84000 EU
1 Abu Dhabi 82880.0 4975593 AS
2 Kabul 647500.0 29121286 AS
3 St. Johns 443.0 86754 NA
4 The Valley 102.0 13254 NA
We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column. For example the average of countries size:
[3]:
countries.area_km2.mean()
[3]:
595287.8555555556
You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access the database multiple times. For example, we want to filter European countries larger than the average country size in the world. See how most countries in Europe are smaller than the world average:
[4]:
cond = countries.area_km2 > countries.area_km2.mean()
expr = countries[(countries.continent == 'EU') & cond]
expr
[4]:
iso_alpha2 iso_alpha3 iso_numeric fips name capital area_km2 \
0 RU RUS 643 RS Russia Moscow 17100000.0
1 UA UKR 804 UP Ukraine Kyiv 603700.0
population continent
0 140702000 EU
1 45415596 EU
Conditional aggregates¶
Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true.
For example, we want to filter European countries larger than the average country size, but this time of the average in Africa. African countries have an smaller size compared to the world average, and France gets into the list:
[5]:
conditional_avg = countries[countries.continent == 'AF'].area_km2.mean()
countries[(countries.continent == 'EU') & (countries.area_km2 > conditional_avg)]
[5]:
iso_alpha2 iso_alpha3 iso_numeric fips name capital area_km2 \
0 FR FRA 250 FR France Paris 547030.0
1 RU RUS 643 RS Russia Moscow 17100000.0
2 UA UKR 804 UP Ukraine Kyiv 603700.0
population continent
0 64768389 EU
1 140702000 EU
2 45415596 EU
“Existence” filters¶
Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it’s nice to be able to express it with a single concise statement and let Ibis compute it optimally.
An example could be finding all countries that had any year with a higher GDP than 3 trillion US dollars:
[6]:
gdp = connection.table('gdp')
gdp
[6]:
country_code year value
0 ABW 1986 4.054634e+08
1 ABW 1987 4.876025e+08
2 ABW 1988 5.964236e+08
3 ABW 1989 6.953044e+08
4 ABW 1990 7.648871e+08
... ... ... ...
9995 SVK 2002 3.513034e+10
9996 SVK 2003 4.681659e+10
9997 SVK 2004 5.733202e+10
9998 SVK 2005 6.278531e+10
9999 SVK 2006 7.070810e+10
[10000 rows x 3 columns]
[7]:
cond = ((gdp.country_code == countries.iso_alpha3) &
(gdp.value > 3e12)).any()
countries[cond]['name']
[7]:
0 China
1 Germany
2 United Kingdom
3 Japan
4 United States
Name: name, dtype: object
Note how this is different than a join between countries
and gdp
, which would return one row per year. The method .any()
is equivalent to filtering with a subquery.
Filtering in aggregations¶
Suppose that you want to compute an aggregation with a subset of the data for only one of the metrics / aggregates in question, and the complete data set with the other aggregates. Most aggregation functions are thus equipped with a where
argument. Let me show it to you in action:
[8]:
arctic = countries.name.isin(['United States',
'Canada',
'Finland',
'Greenland',
'Iceland',
'Norway',
'Russia',
'Sweden'])
metrics = [countries.count().name('# countries'),
countries.population.sum().name('total population'),
countries.population.sum(where=arctic).name('population arctic countries')]
(countries.groupby(countries.continent)
.aggregate(metrics))
[8]:
continent # countries total population population arctic countries
0 AF 58 1021238685 NaN
1 AN 5 170 NaN
2 AS 51 4130584841 NaN
3 EU 54 750724554 161092715.0
4 NA 42 540204371 343968238.0
5 OC 28 36067549 NaN
6 SA 14 400143568 NaN