Advanced Topics: Additional Filtering

The filtering examples we’ve shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin and notin.

Ibis supports a number of richer analytical filters that can involve one or more of:

  • Aggregates computed from the same or other tables

  • Conditional aggregates (in SQL-speak these are similar to “correlated subqueries”)

  • “Existence” set filters (equivalent to the SQL EXISTS and NOT EXISTS keywords)

Setup

[1]:
import os
import ibis

ibis.options.interactive = True

connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))

Using scalar aggregates in filters

[2]:
countries = connection.table('countries')
countries.limit(5)
[2]:
  iso_alpha2 iso_alpha3  iso_numeric fips                  name  \
0         AD        AND           20   AN               Andorra
1         AE        ARE          784   AE  United Arab Emirates
2         AF        AFG            4   AF           Afghanistan
3         AG        ATG           28   AC   Antigua and Barbuda
4         AI        AIA          660   AV              Anguilla

            capital  area_km2  population continent
0  Andorra la Vella     468.0       84000        EU
1         Abu Dhabi   82880.0     4975593        AS
2             Kabul  647500.0    29121286        AS
3         St. Johns     443.0       86754        NA
4        The Valley     102.0       13254        NA

We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column. For example the average of countries size:

[3]:
countries.area_km2.mean()
[3]:
595287.8555555556

You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access the database multiple times. For example, we want to filter European countries larger than the average country size in the world. See how most countries in Europe are smaller than the world average:

[4]:
cond = countries.area_km2 > countries.area_km2.mean()
expr = countries[(countries.continent == 'EU') & cond]
expr
[4]:
  iso_alpha2 iso_alpha3  iso_numeric fips     name capital    area_km2  \
0         RU        RUS          643   RS   Russia  Moscow  17100000.0
1         UA        UKR          804   UP  Ukraine    Kyiv    603700.0

   population continent
0   140702000        EU
1    45415596        EU

Conditional aggregates

Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true.

For example, we want to filter European countries larger than the average country size, but this time of the average in Africa. African countries have an smaller size compared to the world average, and France gets into the list:

[5]:
conditional_avg = countries[countries.continent == 'AF'].area_km2.mean()
countries[(countries.continent == 'EU') & (countries.area_km2 > conditional_avg)]
[5]:
  iso_alpha2 iso_alpha3  iso_numeric fips     name capital    area_km2  \
0         FR        FRA          250   FR   France   Paris    547030.0
1         RU        RUS          643   RS   Russia  Moscow  17100000.0
2         UA        UKR          804   UP  Ukraine    Kyiv    603700.0

   population continent
0    64768389        EU
1   140702000        EU
2    45415596        EU

“Existence” filters

Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it’s nice to be able to express it with a single concise statement and let Ibis compute it optimally.

An example could be finding all countries that had any year with a higher GDP than 3 trillion US dollars:

[6]:
gdp = connection.table('gdp')
gdp
[6]:
     country_code  year         value
0             ABW  1986  4.054634e+08
1             ABW  1987  4.876025e+08
2             ABW  1988  5.964236e+08
3             ABW  1989  6.953044e+08
4             ABW  1990  7.648871e+08
...           ...   ...           ...
9995          SVK  2002  3.513034e+10
9996          SVK  2003  4.681659e+10
9997          SVK  2004  5.733202e+10
9998          SVK  2005  6.278531e+10
9999          SVK  2006  7.070810e+10

[10000 rows x 3 columns]
[7]:
cond = ((gdp.country_code == countries.iso_alpha3) &
        (gdp.value > 3e12)).any()

countries[cond]['name']
[7]:
0             China
1           Germany
2    United Kingdom
3             Japan
4     United States
Name: name, dtype: object

Note how this is different than a join between countries and gdp, which would return one row per year. The method .any() is equivalent to filtering with a subquery.

Filtering in aggregations

Suppose that you want to compute an aggregation with a subset of the data for only one of the metrics / aggregates in question, and the complete data set with the other aggregates. Most aggregation functions are thus equipped with a where argument. Let me show it to you in action:

[8]:
arctic = countries.name.isin(['United States',
                              'Canada',
                              'Finland',
                              'Greenland',
                              'Iceland',
                              'Norway',
                              'Russia',
                              'Sweden'])

metrics = [countries.count().name('# countries'),
           countries.population.sum().name('total population'),
           countries.population.sum(where=arctic).name('population arctic countries')]

(countries.groupby(countries.continent)
          .aggregate(metrics))
[8]:
  continent  # countries  total population  population arctic countries
0        AF           58        1021238685                          NaN
1        AN            5               170                          NaN
2        AS           51        4130584841                          NaN
3        EU           54         750724554                  161092715.0
4        NA           42         540204371                  343968238.0
5        OC           28          36067549                          NaN
6        SA           14         400143568                          NaN