Introduction to Ibis

Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.

In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.

Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.

This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.

Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the Ibis GitHub repository or open a pull request with the fix.

Getting started

To start using Ibis, you need a Python environment with Ibis installed. If you don’t know how to create an environment, we recommend following the setup instructions in the Ibis website.

Once you have your environment ready, to start using Ibis simply import the ibis module:

[1]:
import ibis

To make it things easier in this tutorial, we will be using Ibis interactive mode. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, Expressions, lazy mode and logging queries.

To set the interactive mode, use:

[2]:
ibis.options.interactive = True

Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.

For a comparison to pandas, this is not the same as where the data is imported from (e.g. pandas.read_sql). pandas loads data into memory and performs the computations itself. Ibis won’t load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.

In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,…).

To create a SQL connection to our example SQLite database data/geography.db, use:

[3]:
import os

database_file_path = os.path.join('data', 'geography.db')

connection = ibis.sqlite.connect(database_file_path)

Note that if you installed Ibis with pip instead of conda, you may need to install the SQLite backend separately with pip install ibis-framework[sqlite].

Exploring the data

To list the tables in the connection object, we can use the .list_tables() method. If you are using Jupyter, you can see all the methods and attributes of the connection object by writing connection. and pressing the <TAB> key.

[4]:
connection.list_tables()
[4]:
['countries', 'gdp']

These two tables include data about countries, and about GDP by country and year.

The data from countries has been obtained from GeoNames. The GDP table will be used in the next tutorial, and the data has been obtained from the World Bank website.

Next, we want to access a specific table in the database. We can create a handler to the countries table with:

[5]:
countries = connection.table('countries')

To list the columns of the countries table, we can use the columns attribute.

Again, Jupyter users can see all the methods and attributes of the countries object by typing countries. and pressing <TAB>.

[6]:
countries.columns
[6]:
['iso_alpha2',
 'iso_alpha3',
 'iso_numeric',
 'fips',
 'name',
 'capital',
 'area_km2',
 'population',
 'continent']

We can now access a sample of the data. Let’s focus on the name, continent and population columns to start with. We can visualize the values of the columns with:

[7]:
countries['name', 'continent', 'population']
[7]:
                     name continent  population
0                 Andorra        EU       84000
1    United Arab Emirates        AS     4975593
2             Afghanistan        AS    29121286
3     Antigua and Barbuda        NA       86754
4                Anguilla        NA       13254
..                    ...       ...         ...
247                 Yemen        AS    23495361
248               Mayotte        AF      159042
249          South Africa        AF    49000000
250                Zambia        AF    13460305
251              Zimbabwe        AF    13061000

[252 rows x 3 columns]

The table is too big for all the results to be displayed, and we probably don’t want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.

To check how many rows a table has, we can use the .count() method:

[8]:
countries.count()
[8]:
252

To fetch just a subset of the rows, we can use the .limit(n) method, where n is the number of samples we want. In this case we will fetch the first 3 countries from the table:

[9]:
countries['name', 'continent', 'population'].limit(3)
[9]:
                   name continent  population
0               Andorra        EU       84000
1  United Arab Emirates        AS     4975593
2           Afghanistan        AS    29121286

Filters and order

Now that we’ve got an intuition of the data available in the table countries, we will extract some information from it by applying filters and sorting the data.

Let’s focus on a single continent. We can see a list of unique continents in the table using the .distinct() method:

[10]:
countries['continent'].distinct()
[10]:
0    EU
1    AS
2    NA
3    AF
4    AN
5    SA
6    OC
Name: continent, dtype: object

We will focus on Asia (AS in the table). We can identify which rows belong to Asian countries using the standard Python == operator:

[11]:
countries['continent'] == 'AS'
[11]:
0      False
1       True
2       True
3      False
4      False
       ...
247     True
248    False
249    False
250    False
251    False
Name: tmp, Length: 252, dtype: bool

The result has a value True for rows where the condition is true, and the value False when it’s not.

We can provide this expression to the method .filter(), and save the result in the variable asian_countries for future use.

[12]:
asian_countries = countries['name', 'continent', 'population'].filter(countries['continent'] == 'AS')
asian_countries
[12]:
                              name continent  population
0             United Arab Emirates        AS     4975593
1                      Afghanistan        AS    29121286
2                          Armenia        AS     2968000
3                       Azerbaijan        AS     8303512
4                       Bangladesh        AS   156118464
5                          Bahrain        AS      738004
6                           Brunei        AS      395027
7                           Bhutan        AS      699847
8          Cocos [Keeling] Islands        AS         628
9                            China        AS  1330044000
10                         Georgia        AS     4630000
11                       Hong Kong        AS     6898686
12                       Indonesia        AS   242968342
13                          Israel        AS     7353985
14                           India        AS  1173108018
15  British Indian Ocean Territory        AS        4000
16                            Iraq        AS    29671605
17                            Iran        AS    76923300
18                          Jordan        AS     6407085
19                           Japan        AS   127288000
20                      Kyrgyzstan        AS     5776500
21                        Cambodia        AS    14453680
22                     North Korea        AS    22912177
23                     South Korea        AS    48422644
24                          Kuwait        AS     2789132
25                      Kazakhstan        AS    15340000
26                            Laos        AS     6368162
27                         Lebanon        AS     4125247
28                       Sri Lanka        AS    21513990
29                         Myanmar        AS    53414374
30                        Mongolia        AS     3086918
31                           Macao        AS      449198
32                        Maldives        AS      395650
33                        Malaysia        AS    28274729
34                           Nepal        AS    28951852
35                            Oman        AS     2967717
36                     Philippines        AS    99900177
37                        Pakistan        AS   184404791
38                       Palestine        AS     3800000
39                           Qatar        AS      840926
40                    Saudi Arabia        AS    25731776
41                       Singapore        AS     4701069
42                           Syria        AS    22198110
43                        Thailand        AS    67089500
44                      Tajikistan        AS     7487489
45                    Turkmenistan        AS     4940916
46                          Turkey        AS    77804122
47                          Taiwan        AS    22894384
48                      Uzbekistan        AS    27865738
49                         Vietnam        AS    89571130
50                           Yemen        AS    23495361

We can check how many countries exist in Asia (based on the information in the database) by using the .count() method we’ve already seen:

[13]:
asian_countries.count()
[13]:
51

Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column population, and just fetch the first 10. To sort by a column in Ibis, we can use the .sort_by() method:

[14]:
asian_countries.sort_by('population').limit(10)
[14]:
                             name continent  population
0         Cocos [Keeling] Islands        AS         628
1  British Indian Ocean Territory        AS        4000
2                          Brunei        AS      395027
3                        Maldives        AS      395650
4                           Macao        AS      449198
5                          Bhutan        AS      699847
6                         Bahrain        AS      738004
7                           Qatar        AS      840926
8                          Kuwait        AS     2789132
9                            Oman        AS     2967717

This will return the least populated countries, since .sort_by will by default order in ascending order (ascending order like in 1, 2, 3, 4). This behavior is consistent with SQL ORDER BY.

To order in descending order we can use ibis.desc():

[15]:
asian_countries.sort_by(ibis.desc('population')).limit(10)
[15]:
          name continent  population
0        China        AS  1330044000
1        India        AS  1173108018
2    Indonesia        AS   242968342
3     Pakistan        AS   184404791
4   Bangladesh        AS   156118464
5        Japan        AS   127288000
6  Philippines        AS    99900177
7      Vietnam        AS    89571130
8       Turkey        AS    77804122
9         Iran        AS    76923300

This is the list of the 10 most populated countries based on the data from GeoNames.

To learn more about Ibis, continue to the next tutorial.