Aggregating and joining data¶
This is the second introductory tutorial to Ibis. If you are new to Ibis, you may want to start by the first tutorial, 01-Introduction-to-Ibis.
In the first tutorial, we saw how to operate on the data of a table. We will work again with
the countries
table as we did previously.
!curl -LsS -o $TEMPDIR/geography.db 'https://storage.googleapis.com/ibis-tutorial-data/geography.db'
import os
import tempfile
import ibis
ibis.options.interactive = True
connection = ibis.sqlite.connect(
os.path.join(tempfile.gettempdir(), 'geography.db')
)
countries = connection.table('countries')
countries['name', 'continent', 'area_km2', 'population']
/nix/store/15c5ssh54syvbiv9waav5gcb2r8n0800-python3-3.10.5-env/lib/python3.10/site-packages/pyproj/__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_LIB or with `pyproj.datadir.set_data_dir`. warnings.warn(str(err))
name | continent | area_km2 | population | |
---|---|---|---|---|
0 | Andorra | EU | 468.0 | 84000 |
1 | United Arab Emirates | AS | 82880.0 | 4975593 |
2 | Afghanistan | AS | 647500.0 | 29121286 |
3 | Antigua and Barbuda | NA | 443.0 | 86754 |
4 | Anguilla | NA | 102.0 | 13254 |
... | ... | ... | ... | ... |
247 | Yemen | AS | 527970.0 | 23495361 |
248 | Mayotte | AF | 374.0 | 159042 |
249 | South Africa | AF | 1219912.0 | 49000000 |
250 | Zambia | AF | 752614.0 | 13460305 |
251 | Zimbabwe | AF | 390580.0 | 13061000 |
252 rows × 4 columns
Expressions¶
We will continue by exploring the data by continent. We will start by creating an expression with the continent names, since our table only contains the abbreviations.
An expression is one or more operations performed over the data. They can be used to retrieve the data or to build more complex operations.
In this case we will use a case
conditional statement to replace values depending on a condition.
A case
expression will return a case builder, and must be followed by one or more when
calls,
optionally an else_
call, and must end with a call to end
, to complete the full expression.
The expression where case
is called (countries['continent']
in this case)
is evaluated to see if it's equal to any of the first arguments of the calls to when
. And the second
argument is returned. If the value does not match any of the when
values, the value of else_
is returned.
continent_name = (
countries['continent']
.case()
.when('NA', 'North America')
.when('SA', 'South America')
.when('EU', 'Europe')
.when('AF', 'Africa')
.when('AS', 'Asia')
.when('OC', 'Oceania')
.when('AN', 'Antarctica')
.else_('Unknown continent')
.end()
.name('continent_name')
)
continent_name
continent_name | |
---|---|
0 | Europe |
1 | Asia |
2 | Asia |
3 | North America |
4 | North America |
... | ... |
247 | Asia |
248 | Africa |
249 | Africa |
250 | Africa |
251 | Africa |
252 rows × 1 columns
What we did is take the values of the column countries['continent']
, and we created a calculated
column with the names of the continents, as defined in the when
methods.
This calculated column is an expression. The computations didn't happen when defining the continent_name
variable, and the results are not stored. They have been computed when we printed its content.
We can see that by checking the type of continent_name
:
type(continent_name)
ibis.expr.types.strings.StringColumn
In the next tutorial we will see more about eager and lazy mode, and when operations are being executed. For now we can think that the query to the database happens only when we want to see the results.
The important part is that now we can use our continent_name
expression in other expressions.
For example, since this is a column (a StringColumn
to be specific), we can use it as a column
to query the countries table.
Note that when we created the expression we added .name('continent_name')
to it, so the column
has a name when being returned.
countries['name', continent_name, 'area_km2', 'population']
name | continent_name | area_km2 | population | |
---|---|---|---|---|
0 | Andorra | Europe | 468.0 | 84000 |
1 | United Arab Emirates | Asia | 82880.0 | 4975593 |
2 | Afghanistan | Asia | 647500.0 | 29121286 |
3 | Antigua and Barbuda | North America | 443.0 | 86754 |
4 | Anguilla | North America | 102.0 | 13254 |
... | ... | ... | ... | ... |
247 | Yemen | Asia | 527970.0 | 23495361 |
248 | Mayotte | Africa | 374.0 | 159042 |
249 | South Africa | Africa | 1219912.0 | 49000000 |
250 | Zambia | Africa | 752614.0 | 13460305 |
251 | Zimbabwe | Africa | 390580.0 | 13061000 |
252 rows × 4 columns
Just for illustration, let's repeat the same query, but renaming the expression to continent
when using it in the list of columns to fetch.
countries['name', continent_name.name('continent'), 'area_km2', 'population']
name | continent | area_km2 | population | |
---|---|---|---|---|
0 | Andorra | Europe | 468.0 | 84000 |
1 | United Arab Emirates | Asia | 82880.0 | 4975593 |
2 | Afghanistan | Asia | 647500.0 | 29121286 |
3 | Antigua and Barbuda | North America | 443.0 | 86754 |
4 | Anguilla | North America | 102.0 | 13254 |
... | ... | ... | ... | ... |
247 | Yemen | Asia | 527970.0 | 23495361 |
248 | Mayotte | Africa | 374.0 | 159042 |
249 | South Africa | Africa | 1219912.0 | 49000000 |
250 | Zambia | Africa | 752614.0 | 13460305 |
251 | Zimbabwe | Africa | 390580.0 | 13061000 |
252 rows × 4 columns
Aggregating data¶
Now, let's group our data by continent, and let's find the total population of each.
countries.group_by(continent_name).aggregate(
countries['population'].sum().name('total_population')
)
continent_name | total_population | |
---|---|---|
0 | Africa | 1021238685 |
1 | Antarctica | 170 |
2 | Asia | 4130584841 |
3 | Europe | 750724554 |
4 | North America | 540204371 |
5 | Oceania | 36067549 |
6 | South America | 400143568 |
We can see how Asia is the most populated country, followed by Africa. Antarctica is the least populated, as we would expect.
The code to aggregate has two main parts:
- The
group_by
method, that receive the column, expression or list of them to group by - The
aggregate
method, that receives an expression with the reduction we want to apply
To make things a bit clearer, let's first save the reduction.
total_population = countries['population'].sum().name('total_population')
total_population
6878963738
As we can see, if we perform the operation directly, we will get the sum of the total in the column.
But if we take the total_population
expression as the parameter of the aggregate
method, then the total is computed
over every group defined by the group_by
method.
countries.group_by(continent_name).aggregate(total_population)
continent_name | total_population | |
---|---|---|
0 | Africa | 1021238685 |
1 | Antarctica | 170 |
2 | Asia | 4130584841 |
3 | Europe | 750724554 |
4 | North America | 540204371 |
5 | Oceania | 36067549 |
6 | South America | 400143568 |
If we want to compute two aggregates at the same time, we can pass a list to the aggregate
method.
For illustration, we use the continent
column, instead of the continent_names
expression. We can
use both column names and expressions, and also a list with any of them (e.g. [continent_names, 'name']
.
countries.group_by('continent').aggregate(
[total_population, countries['area_km2'].mean().name('average_area')]
)
continent | total_population | average_area | |
---|---|---|---|
0 | AF | 1021238685 | 5.234534e+05 |
1 | AN | 170 | 2.802439e+06 |
2 | AS | 4130584841 | 6.196685e+05 |
3 | EU | 750724554 | 4.293017e+05 |
4 | NA | 540204371 | 5.836313e+05 |
5 | OC | 36067549 | 3.044157e+05 |
6 | SA | 400143568 | 1.272751e+06 |
Joining data¶
Now we are going to get the total gross domestic product (GDP) for each continent. In this case, the GDP data
is not in the same table countries
, but in a table gdp
.
gdp = connection.table('gdp')
gdp
country_code | year | value | |
---|---|---|---|
0 | ABW | 1986 | 4.054634e+08 |
1 | ABW | 1987 | 4.876025e+08 |
2 | ABW | 1988 | 5.964236e+08 |
3 | ABW | 1989 | 6.953044e+08 |
4 | ABW | 1990 | 7.648871e+08 |
... | ... | ... | ... |
9995 | SVK | 2002 | 3.513034e+10 |
9996 | SVK | 2003 | 4.681659e+10 |
9997 | SVK | 2004 | 5.733202e+10 |
9998 | SVK | 2005 | 6.278531e+10 |
9999 | SVK | 2006 | 7.070810e+10 |
10000 rows × 3 columns
The table contains information for different years, we can easily check the range with:
gdp['year'].min(), gdp['year'].max()
(1960, 2017)
Now, we are going to join this data with the countries
table so we can obtain the continent
of each country. The countries
table has several different codes for the countries. Let's find out which
one matches the three letter code in the gdp
table.
countries['iso_alpha2', 'iso_alpha3', 'iso_numeric', 'fips', 'name']
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | |
---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra |
1 | AE | ARE | 784 | AE | United Arab Emirates |
2 | AF | AFG | 4 | AF | Afghanistan |
3 | AG | ATG | 28 | AC | Antigua and Barbuda |
4 | AI | AIA | 660 | AV | Anguilla |
... | ... | ... | ... | ... | ... |
247 | YE | YEM | 887 | YM | Yemen |
248 | YT | MYT | 175 | MF | Mayotte |
249 | ZA | ZAF | 710 | SF | South Africa |
250 | ZM | ZMB | 894 | ZA | Zambia |
251 | ZW | ZWE | 716 | ZI | Zimbabwe |
252 rows × 5 columns
The country_code
in gdp
corresponds to iso_alpha2
in the countries
table. We can also see
how the gdp
table has 10,000
rows, while countries
has 252
. We will start joining the
two tables by the codes that match, discarding the codes that do not exist in both tables.
This is called an inner join.
countries_and_gdp = countries.inner_join(
gdp, predicates=countries['iso_alpha3'] == gdp['country_code']
)
countries_and_gdp[countries, gdp]
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | country_code | year | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 1970 | 7.861921e+07 |
1 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 1971 | 8.940982e+07 |
2 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 1972 | 1.134082e+08 |
3 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 1973 | 1.508201e+08 |
4 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 1974 | 1.865587e+08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9482 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2013 | 1.909102e+10 |
9483 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2014 | 1.949552e+10 |
9484 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2015 | 1.996312e+10 |
9485 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2016 | 2.054868e+10 |
9486 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2017 | 2.281301e+10 |
9487 rows × 12 columns
We joined the table with the information for all years. Now we are going to just take the information about the last available year, 2017.
gdp_2017 = gdp.filter(gdp['year'] == 2017)
gdp_2017
country_code | year | value | |
---|---|---|---|
0 | ABW | 2017 | 2.700559e+09 |
1 | AFG | 2017 | 2.019176e+10 |
2 | AGO | 2017 | 1.221238e+11 |
3 | ALB | 2017 | 1.302506e+10 |
4 | AND | 2017 | 3.013387e+09 |
... | ... | ... | ... |
242 | XKX | 2017 | 7.227700e+09 |
243 | YEM | 2017 | 2.681870e+10 |
244 | ZAF | 2017 | 3.495541e+11 |
245 | ZMB | 2017 | 2.586814e+10 |
246 | ZWE | 2017 | 2.281301e+10 |
247 rows × 3 columns
Joining with the new expression we get:
countries_and_gdp = countries.inner_join(
gdp_2017, predicates=countries['iso_alpha3'] == gdp_2017['country_code']
)
countries_and_gdp[countries, gdp_2017]
iso_alpha2 | iso_alpha3 | iso_numeric | fips | name | capital | area_km2 | population | continent | country_code | year | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AW | ABW | 533 | AA | Aruba | Oranjestad | 193.0 | 71566 | NA | ABW | 2017 | 2.700559e+09 |
1 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 29121286 | AS | AFG | 2017 | 2.019176e+10 |
2 | AO | AGO | 24 | AO | Angola | Luanda | 1246700.0 | 13068161 | AF | AGO | 2017 | 1.221238e+11 |
3 | AL | ALB | 8 | AL | Albania | Tirana | 28748.0 | 2986952 | EU | ALB | 2017 | 1.302506e+10 |
4 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 84000 | EU | AND | 2017 | 3.013387e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
196 | XK | XKX | 0 | KV | Kosovo | Pristina | 10908.0 | 1800000 | EU | XKX | 2017 | 7.227700e+09 |
197 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 23495361 | AS | YEM | 2017 | 2.681870e+10 |
198 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 49000000 | AF | ZAF | 2017 | 3.495541e+11 |
199 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 13460305 | AF | ZMB | 2017 | 2.586814e+10 |
200 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 13061000 | AF | ZWE | 2017 | 2.281301e+10 |
201 rows × 12 columns
We have called the inner_join
method of the countries
table and passed
the gdp
table as a parameter. The method receives a second parameter, predicates
, that is used to specify
how the join will be performed. In this case we want the iso_alpha3
column in countries
to
match the country_code
column in gdp
. This is specified with the expression
countries['iso_alpha3'] == gdp['country_code']
.