Introduction to Ibis¶
Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.
In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.
Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.
This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.
Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the Ibis GitHub repository or open a pull request with the fix.
Getting started¶
To start using Ibis, you need a Python environment with Ibis installed. If you don’t know how to create an environment, we recommend following the setup instructions in the Ibis website.
Once you have your environment ready, to start using Ibis simply import the ibis
module:
[1]:
import ibis
To make it things easier in this tutorial, we will be using Ibis interactive mode. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, Expressions, lazy mode and logging queries.
To set the interactive mode, use:
[2]:
ibis.options.interactive = True
Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.
For a comparison to pandas, this is not the same as where the data is imported from (e.g. pandas.read_sql
). pandas loads data into memory and performs the computations itself. Ibis won’t load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.
In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,…).
To create a SQL connection to our example SQLite database data/geography.db
, use:
[3]:
import os
database_file_path = os.path.join('data', 'geography.db')
connection = ibis.sqlite.connect(database_file_path)
Note that if you installed Ibis with pip
instead of conda
, you may need to install the SQLite backend separately with pip install ibis-framework[sqlite]
.
Exploring the data¶
To list the tables in the connection
object, we can use the .list_tables()
method. If you are using Jupyter, you can see all the methods and attributes of the connection
object by writing connection.
and pressing the <TAB>
key.
[4]:
connection.list_tables()
[4]:
['countries', 'gdp']
These two tables include data about countries, and about GDP by country and year.
The data from countries has been obtained from GeoNames. The GDP table will be used in the next tutorial, and the data has been obtained from the World Bank website.
Next, we want to access a specific table in the database. We can create a handler to the countries
table with:
[5]:
countries = connection.table('countries')
To list the columns of the countries
table, we can use the columns
attribute.
Again, Jupyter users can see all the methods and attributes of the countries
object by typing countries.
and pressing <TAB>
.
[6]:
countries.columns
[6]:
['iso_alpha2',
'iso_alpha3',
'iso_numeric',
'fips',
'name',
'capital',
'area_km2',
'population',
'continent']
We can now access a sample of the data. Let’s focus on the name
, continent
and population
columns to start with. We can visualize the values of the columns with:
[7]:
countries['name', 'continent', 'population']
[7]:
name continent population
0 Andorra EU 84000
1 United Arab Emirates AS 4975593
2 Afghanistan AS 29121286
3 Antigua and Barbuda NA 86754
4 Anguilla NA 13254
.. ... ... ...
247 Yemen AS 23495361
248 Mayotte AF 159042
249 South Africa AF 49000000
250 Zambia AF 13460305
251 Zimbabwe AF 13061000
[252 rows x 3 columns]
The table is too big for all the results to be displayed, and we probably don’t want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.
To check how many rows a table has, we can use the .count()
method:
[8]:
countries.count()
[8]:
252
To fetch just a subset of the rows, we can use the .limit(n)
method, where n
is the number of samples we want. In this case we will fetch the first 3
countries from the table:
[9]:
countries['name', 'continent', 'population'].limit(3)
[9]:
name continent population
0 Andorra EU 84000
1 United Arab Emirates AS 4975593
2 Afghanistan AS 29121286
Filters and order¶
Now that we’ve got an intuition of the data available in the table countries
, we will extract some information from it by applying filters and sorting the data.
Let’s focus on a single continent. We can see a list of unique continents in the table using the .distinct()
method:
[10]:
countries['continent'].distinct()
[10]:
0 EU
1 AS
2 NA
3 AF
4 AN
5 SA
6 OC
Name: continent, dtype: object
We will focus on Asia (AS
in the table). We can identify which rows belong to Asian countries using the standard Python ==
operator:
[11]:
countries['continent'] == 'AS'
[11]:
0 False
1 True
2 True
3 False
4 False
...
247 True
248 False
249 False
250 False
251 False
Name: tmp, Length: 252, dtype: bool
The result has a value True
for rows where the condition is true, and the value False
when it’s not.
We can provide this expression to the method .filter()
, and save the result in the variable asian_countries
for future use.
[12]:
asian_countries = countries['name', 'continent', 'population'].filter(countries['continent'] == 'AS')
asian_countries
[12]:
name continent population
0 United Arab Emirates AS 4975593
1 Afghanistan AS 29121286
2 Armenia AS 2968000
3 Azerbaijan AS 8303512
4 Bangladesh AS 156118464
5 Bahrain AS 738004
6 Brunei AS 395027
7 Bhutan AS 699847
8 Cocos [Keeling] Islands AS 628
9 China AS 1330044000
10 Georgia AS 4630000
11 Hong Kong AS 6898686
12 Indonesia AS 242968342
13 Israel AS 7353985
14 India AS 1173108018
15 British Indian Ocean Territory AS 4000
16 Iraq AS 29671605
17 Iran AS 76923300
18 Jordan AS 6407085
19 Japan AS 127288000
20 Kyrgyzstan AS 5776500
21 Cambodia AS 14453680
22 North Korea AS 22912177
23 South Korea AS 48422644
24 Kuwait AS 2789132
25 Kazakhstan AS 15340000
26 Laos AS 6368162
27 Lebanon AS 4125247
28 Sri Lanka AS 21513990
29 Myanmar AS 53414374
30 Mongolia AS 3086918
31 Macao AS 449198
32 Maldives AS 395650
33 Malaysia AS 28274729
34 Nepal AS 28951852
35 Oman AS 2967717
36 Philippines AS 99900177
37 Pakistan AS 184404791
38 Palestine AS 3800000
39 Qatar AS 840926
40 Saudi Arabia AS 25731776
41 Singapore AS 4701069
42 Syria AS 22198110
43 Thailand AS 67089500
44 Tajikistan AS 7487489
45 Turkmenistan AS 4940916
46 Turkey AS 77804122
47 Taiwan AS 22894384
48 Uzbekistan AS 27865738
49 Vietnam AS 89571130
50 Yemen AS 23495361
We can check how many countries exist in Asia (based on the information in the database) by using the .count()
method we’ve already seen:
[13]:
asian_countries.count()
[13]:
51
Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column population
, and just fetch the first 10. To sort by a column in Ibis, we can use the .sort_by()
method:
[14]:
asian_countries.sort_by('population').limit(10)
[14]:
name continent population
0 Cocos [Keeling] Islands AS 628
1 British Indian Ocean Territory AS 4000
2 Brunei AS 395027
3 Maldives AS 395650
4 Macao AS 449198
5 Bhutan AS 699847
6 Bahrain AS 738004
7 Qatar AS 840926
8 Kuwait AS 2789132
9 Oman AS 2967717
This will return the least populated countries, since .sort_by
will by default order in ascending order (ascending order like in 1, 2, 3, 4
). This behavior is consistent with SQL ORDER BY
.
To order in descending order we can use ibis.desc()
:
[15]:
asian_countries.sort_by(ibis.desc('population')).limit(10)
[15]:
name continent population
0 China AS 1330044000
1 India AS 1173108018
2 Indonesia AS 242968342
3 Pakistan AS 184404791
4 Bangladesh AS 156118464
5 Japan AS 127288000
6 Philippines AS 99900177
7 Vietnam AS 89571130
8 Turkey AS 77804122
9 Iran AS 76923300
This is the list of the 10 most populated countries based on the data from GeoNames.
To learn more about Ibis, continue to the next tutorial.