{
“cells”: [
{

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“# Aggregating and joining datan”, “n”, “This is the second introductory tutorial to Ibis. If you are new to Ibis, you may want to startn”, “by the first tutorial, _01-Introduction-to-Ibis_.n”, “n”, “In the first tutorial, we saw how to operate on the data of a table. We will work again withn”, “the countries table as we did previously.”

]

}, {

“cell_type”: “code”, “execution_count”: 1, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent area_km2 populationn”, “0 Andorra EU 468.0 84000n”, “1 United Arab Emirates AS 82880.0 4975593n”, “2 Afghanistan AS 647500.0 29121286n”, “3 Antigua and Barbuda NA 443.0 86754n”, “4 Anguilla NA 102.0 13254n”, “.. … … … …n”, “247 Yemen AS 527970.0 23495361n”, “248 Mayotte AF 374.0 159042n”, “249 South Africa AF 1219912.0 49000000n”, “250 Zambia AF 752614.0 13460305n”, “251 Zimbabwe AF 390580.0 13061000n”, “n”, “[252 rows x 4 columns]”

]

}, “execution_count”: 1, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“import osn”, “import ibisn”, “n”, “n”, “ibis.options.interactive = Truen”, “n”, “connection = ibis.sqlite.connect(os.path.join(‘data’, ‘geography.db’))n”, “countries = connection.table(‘countries’)n”, “n”, “countries[‘name’, ‘continent’, ‘area_km2’, ‘population’]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Expressionsn”, “n”, “We will continue by exploring the data by continent. We will start by creating an expressionn”, “with the continent names, since our table only contains the abbreviations.n”, “n”, “An expression is one or more operations performed over the data. They can be used to retrieve then”, “data or to build more complex operations.n”, “n”, “In this case we will use a case conditional statement to replace values depending on a condition.n”, “A case expression will return a case builder, and must be followed by one or more when calls,n”, “optionally an else_ call, and must end with a call to end, to complete the full expression.n”, “The expression where case is called (countries[‘continent’] in this case)n”, “is evaluated to see if it’s equal to any of the first arguments of the calls to when. And the secondn”, “argument is returned. If the value does not match any of the when values, the value of else_ is returned.”

]

}, {

“cell_type”: “code”, “execution_count”: 2, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“0 Europen”, “1 Asian”, “2 Asian”, “3 North American”, “4 North American”, ” … n”, “247 Asian”, “248 African”, “249 African”, “250 African”, “251 African”, “Name: tmp, Length: 252, dtype: object”

]

}, “execution_count”: 2, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“continent_name = (countries[‘continent’].case()n”, ” .when(‘NA’, ‘North America’)n”, ” .when(‘SA’, ‘South America’)n”, ” .when(‘EU’, ‘Europe’)n”, ” .when(‘AF’, ‘Africa’)n”, ” .when(‘AS’, ‘Asia’)n”, ” .when(‘OC’, ‘Oceania’)n”, ” .when(‘AN’, ‘Anctartica’)n”, ” .else_(‘Unknown continent’)n”, ” .end()n”, ” .name(‘continent_name’))n”, “continent_name”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“What we did is take the values of the column countries[‘continent’], and we created a calculatedn”, “column with the names of the continents, as defined in the when methods.n”, “n”, “This calculated column is an expression. The computations didn’t happen when defining the continent_namen”, “variable, and the results are not stored. They have been computed when we printed its content.n”, “n”, “We can see that by checking the type of continent_name:”

]

}, {

“cell_type”: “code”, “execution_count”: 3, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“ibis.expr.types.StringColumn”

]

}, “execution_count”: 3, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“type(continent_name)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“In the next tutorial we will see more about eager and lazy mode, and when operations are beingn”, “executed. For now we can think that the query to the database happens only when we want to seen”, “the results.n”, “n”, “The important part is that now we can use our continent_name expression in other expressions.n”, “For example, since this is a column (a StringColumn to be specific), we can use it as a columnn”, “to query the countries table.n”, “n”, “Note that when we created the expression we added .name(‘continent_name’) to it, so the columnn”, “has a name when being returned.”

]

}, {

“cell_type”: “code”, “execution_count”: 4, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent_name area_km2 populationn”, “0 Andorra Europe 468.0 84000n”, “1 United Arab Emirates Asia 82880.0 4975593n”, “2 Afghanistan Asia 647500.0 29121286n”, “3 Antigua and Barbuda North America 443.0 86754n”, “4 Anguilla North America 102.0 13254n”, “.. … … … …n”, “247 Yemen Asia 527970.0 23495361n”, “248 Mayotte Africa 374.0 159042n”, “249 South Africa Africa 1219912.0 49000000n”, “250 Zambia Africa 752614.0 13460305n”, “251 Zimbabwe Africa 390580.0 13061000n”, “n”, “[252 rows x 4 columns]”

]

}, “execution_count”: 4, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘name’, continent_name, ‘area_km2’, ‘population’]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Just for illustration, let’s repeat the same query, but renaming the expression to continentn”, “when using it in the list of columns to fetch.”

]

}, {

“cell_type”: “code”, “execution_count”: 5, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent area_km2 populationn”, “0 Andorra Europe 468.0 84000n”, “1 United Arab Emirates Asia 82880.0 4975593n”, “2 Afghanistan Asia 647500.0 29121286n”, “3 Antigua and Barbuda North America 443.0 86754n”, “4 Anguilla North America 102.0 13254n”, “.. … … … …n”, “247 Yemen Asia 527970.0 23495361n”, “248 Mayotte Africa 374.0 159042n”, “249 South Africa Africa 1219912.0 49000000n”, “250 Zambia Africa 752614.0 13460305n”, “251 Zimbabwe Africa 390580.0 13061000n”, “n”, “[252 rows x 4 columns]”

]

}, “execution_count”: 5, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘name’, continent_name.name(‘continent’), ‘area_km2’, ‘population’]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Aggregating datan”, “n”, “Now, let’s group our data by continent, and let’s find the total population of each.”

]

}, {

“cell_type”: “code”, “execution_count”: 6, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” continent_name total_populationn”, “0 Africa 1021238685n”, “1 Anctartica 170n”, “2 Asia 4130584841n”, “3 Europe 750724554n”, “4 North America 540204371n”, “5 Oceania 36067549n”, “6 South America 400143568”

]

}, “execution_count”: 6, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries.group_by(continent_name).aggregate(countries[‘population’].sum().name(‘total_population’))”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We can see how Asia is the most populated country, followed by Africa. Antarctica is the least populated,n”, “as we would expect.n”, “n”, “The code to aggregate has two main parts:n”, “- The group_by method, that receive the column, expression or list of them to group byn”, “- The aggregate method, that receives an expression with the reduction we want to applyn”, “n”, “To make things a bit clearer, let’s first save the reduction.”

]

}, {

“cell_type”: “code”, “execution_count”: 7, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“6878963738”

]

}, “execution_count”: 7, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“total_population = countries[‘population’].sum().name(‘total_population’)n”, “total_population”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“As we can see, if we perform the operation directly, we will get the sum of the total in the column.n”, “n”, “But if we take the total_population expression as the parameter of the aggregate method, then the total is computedn”, “over every group defined by the group_by method.”

]

}, {

“cell_type”: “code”, “execution_count”: 8, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” continent_name total_populationn”, “0 Africa 1021238685n”, “1 Anctartica 170n”, “2 Asia 4130584841n”, “3 Europe 750724554n”, “4 North America 540204371n”, “5 Oceania 36067549n”, “6 South America 400143568”

]

}, “execution_count”: 8, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries.group_by(continent_name).aggregate(total_population)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“If we want to compute two aggregates at the same time, we can pass a list to the aggregate method.n”, “n”, “For illustration, we use the continent column, instead of the continent_names expression. We cann”, “use both column names and expressions, and also a list with any of them (e.g. [continent_names, ‘name’].”

]

}, {

“cell_type”: “code”, “execution_count”: 9, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” continent total_population average_arean”, “0 AF 1021238685 5.234534e+05n”, “1 AN 170 2.802439e+06n”, “2 AS 4130584841 6.196685e+05n”, “3 EU 750724554 4.293017e+05n”, “4 NA 540204371 5.836313e+05n”, “5 OC 36067549 3.044157e+05n”, “6 SA 400143568 1.272751e+06”

]

}, “execution_count”: 9, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries.group_by(‘continent’).aggregate([total_population,n”, ” countries[‘area_km2’].mean().name(‘average_area’)])”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Joining datan”, “n”, “Now we are going to get the total gross domestic product (GDP) for each continent. In this case, the GDP datan”, “is not in the same table countries, but in a table gdp.”

]

}, {

“cell_type”: “code”, “execution_count”: 10, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” country_code year valuen”, “0 ABW 1986 4.054634e+08n”, “1 ABW 1987 4.876025e+08n”, “2 ABW 1988 5.964236e+08n”, “3 ABW 1989 6.953044e+08n”, “4 ABW 1990 7.648871e+08n”, “… … … …n”, “9995 SVK 2002 3.513034e+10n”, “9996 SVK 2003 4.681659e+10n”, “9997 SVK 2004 5.733202e+10n”, “9998 SVK 2005 6.278531e+10n”, “9999 SVK 2006 7.070810e+10n”, “n”, “[10000 rows x 3 columns]”

]

}, “execution_count”: 10, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“gdp = connection.table(‘gdp’)n”, “gdp”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“The table contains information for different years, we can easily check the range with:”

]

}, {

“cell_type”: “code”, “execution_count”: 11, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“(1960, 2017)”

]

}, “execution_count”: 11, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“gdp[‘year’].min(), gdp[‘year’].max()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Now, we are going to join this data with the countries table so we can obtain the continentn”, “of each country. The countries table has several different codes for the countries. Let’s find out whichn”, “one matches the three letter code in the gdp table.”

]

}, {

“cell_type”: “code”, “execution_count”: 12, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” iso_alpha2 iso_alpha3 iso_numeric fips namen”, “0 AD AND 20 AN Andorran”, “1 AE ARE 784 AE United Arab Emiratesn”, “2 AF AFG 4 AF Afghanistann”, “3 AG ATG 28 AC Antigua and Barbudan”, “4 AI AIA 660 AV Anguillan”, “.. … … … … …n”, “247 YE YEM 887 YM Yemenn”, “248 YT MYT 175 MF Mayotten”, “249 ZA ZAF 710 SF South African”, “250 ZM ZMB 894 ZA Zambian”, “251 ZW ZWE 716 ZI Zimbabwen”, “n”, “[252 rows x 5 columns]”

]

}, “execution_count”: 12, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘iso_alpha2’, ‘iso_alpha3’, ‘iso_numeric’, ‘fips’, ‘name’]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“The country_code in gdp corresponds to iso_alpha2 in the countries table. We can also seen”, “how the gdp table has 10,000 rows, while countries has 252. We will start joining then”, “two tables by the codes that match, discarding the codes that do not exist in both tables.n”, “This is called an inner join.”

]

}, {

“cell_type”: “code”, “execution_count”: 13, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” iso_alpha2 iso_alpha3 iso_numeric fips name capital \n”, “0 AD AND 20 AN Andorra Andorra la Vella n”, “1 AD AND 20 AN Andorra Andorra la Vella n”, “2 AD AND 20 AN Andorra Andorra la Vella n”, “3 AD AND 20 AN Andorra Andorra la Vella n”, “4 AD AND 20 AN Andorra Andorra la Vella n”, “… … … … … … … n”, “9482 ZW ZWE 716 ZI Zimbabwe Harare n”, “9483 ZW ZWE 716 ZI Zimbabwe Harare n”, “9484 ZW ZWE 716 ZI Zimbabwe Harare n”, “9485 ZW ZWE 716 ZI Zimbabwe Harare n”, “9486 ZW ZWE 716 ZI Zimbabwe Harare n”, “n”, ” area_km2 population continent country_code year value n”, “0 468.0 84000 EU AND 1970 7.861921e+07 n”, “1 468.0 84000 EU AND 1971 8.940982e+07 n”, “2 468.0 84000 EU AND 1972 1.134082e+08 n”, “3 468.0 84000 EU AND 1973 1.508201e+08 n”, “4 468.0 84000 EU AND 1974 1.865587e+08 n”, “… … … … … … … n”, “9482 390580.0 13061000 AF ZWE 2013 1.909102e+10 n”, “9483 390580.0 13061000 AF ZWE 2014 1.949552e+10 n”, “9484 390580.0 13061000 AF ZWE 2015 1.996312e+10 n”, “9485 390580.0 13061000 AF ZWE 2016 2.054868e+10 n”, “9486 390580.0 13061000 AF ZWE 2017 2.281301e+10 n”, “n”, “[9487 rows x 12 columns]”

]

}, “execution_count”: 13, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries_and_gdp = countries.inner_join(gdp,n”, ” predicates=countries[‘iso_alpha3’] == gdp[‘country_code’])n”, “countries_and_gdp[countries, gdp]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We joined the table with the information for all years. Now we are going to just take the information about the last available year, 2017.”

]

}, {

“cell_type”: “code”, “execution_count”: 14, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” country_code year valuen”, “0 ABW 2017 2.700559e+09n”, “1 AFG 2017 2.019176e+10n”, “2 AGO 2017 1.221238e+11n”, “3 ALB 2017 1.302506e+10n”, “4 AND 2017 3.013387e+09n”, “.. … … …n”, “242 XKX 2017 7.227700e+09n”, “243 YEM 2017 2.681870e+10n”, “244 ZAF 2017 3.495541e+11n”, “245 ZMB 2017 2.586814e+10n”, “246 ZWE 2017 2.281301e+10n”, “n”, “[247 rows x 3 columns]”

]

}, “execution_count”: 14, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“gdp_2017 = gdp.filter(gdp[‘year’] == 2017)n”, “gdp_2017”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Joining with the new expression we get:”

]

}, {

“cell_type”: “code”, “execution_count”: 15, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” iso_alpha2 iso_alpha3 iso_numeric fips name capital \n”, “0 AW ABW 533 AA Aruba Oranjestad n”, “1 AF AFG 4 AF Afghanistan Kabul n”, “2 AO AGO 24 AO Angola Luanda n”, “3 AL ALB 8 AL Albania Tirana n”, “4 AD AND 20 AN Andorra Andorra la Vella n”, “.. … … … … … … n”, “196 XK XKX 0 KV Kosovo Pristina n”, “197 YE YEM 887 YM Yemen Sanaa n”, “198 ZA ZAF 710 SF South Africa Pretoria n”, “199 ZM ZMB 894 ZA Zambia Lusaka n”, “200 ZW ZWE 716 ZI Zimbabwe Harare n”, “n”, ” area_km2 population continent country_code year value n”, “0 193.0 71566 NA ABW 2017 2.700559e+09 n”, “1 647500.0 29121286 AS AFG 2017 2.019176e+10 n”, “2 1246700.0 13068161 AF AGO 2017 1.221238e+11 n”, “3 28748.0 2986952 EU ALB 2017 1.302506e+10 n”, “4 468.0 84000 EU AND 2017 3.013387e+09 n”, “.. … … … … … … n”, “196 10908.0 1800000 EU XKX 2017 7.227700e+09 n”, “197 527970.0 23495361 AS YEM 2017 2.681870e+10 n”, “198 1219912.0 49000000 AF ZAF 2017 3.495541e+11 n”, “199 752614.0 13460305 AF ZMB 2017 2.586814e+10 n”, “200 390580.0 13061000 AF ZWE 2017 2.281301e+10 n”, “n”, “[201 rows x 12 columns]”

]

}, “execution_count”: 15, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries_and_gdp = countries.inner_join(gdp_2017,n”, ” predicates=countries[‘iso_alpha3’] == gdp_2017[‘country_code’])n”, “countries_and_gdp[countries, gdp_2017]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We have called the inner_join method of the countries table and passedn”, “the gdp table as a parameter. The method receives a second parameter, predicates, that is used to specifyn”, “how the join will be performed. In this case we want the iso_alpha3 column in countries ton”, “match the country_code column in gdp. This is specified with the expressionn”, “countries[‘iso_alpha3’] == gdp[‘country_code’].n”, “n”, “In the example countries_and_gdp is not an expression that can be executed directly. The result of a joinn”, “can cause conflicts if both tables have column names in common. Before we can execute the query, we needn”, “to materialize the join, by calling the .materialize() method, or by selecting the columns to be returned.”

]

}

], “metadata”: {

“kernelspec”: {

“display_name”: “Python 3”, “language”: “python”, “name”: “python3”

}, “language_info”: {

“codemirror_mode”: {

“name”: “ipython”, “version”: 3

}, “file_extension”: “.py”, “mimetype”: “text/x-python”, “name”: “python”, “nbconvert_exporter”: “python”, “pygments_lexer”: “ipython3”, “version”: “3.8.5”

}

}, “nbformat”: 4, “nbformat_minor”: 4

}