Introduction to Ibis¶
Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.
In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.
Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.
This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.
Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the Ibis GitHub repository or open a pull request with the fix.
Getting started¶
To start using Ibis, you need a Python environment with Ibis installed. Follow the installation instructions for SQLite to setup an environment.
Once you have your environment ready, to start using Ibis simply import the ibis
module:
import ibis
To make it things easier in this tutorial, we will be using Ibis interactive mode. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, Expressions, lazy mode and logging queries.
To set the interactive mode, use:
ibis.options.interactive = True
Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.
For a comparison to pandas, this is not the same as where the data is imported from (e.g. pandas.read_sql
). pandas loads data into memory and performs the computations itself. Ibis won't load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.
In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,...).
Let's download the SQLite database from the ibis-tutorial-data
GCS (Google Cloud Storage) bucket, then connect to it using ibis
.
!curl -LsS -o $TEMPDIR/geography.db 'https://storage.googleapis.com/ibis-tutorial-data/geography.db'
import os
import tempfile
connection = ibis.sqlite.connect(
os.path.join(tempfile.gettempdir(), 'geography.db')
)
/nix/store/15c5ssh54syvbiv9waav5gcb2r8n0800-python3-3.10.5-env/lib/python3.10/site-packages/pyproj/__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_LIB or with `pyproj.datadir.set_data_dir`. warnings.warn(str(err))
Note that if you installed Ibis with pip
instead of conda
, you may need to install the SQLite backend separately with pip install 'ibis-framework[sqlite]'
.
Exploring the data¶
To list the tables in the connection
object, we can use the .list_tables()
method. If you are using Jupyter, you can see all the methods and attributes of the connection
object by writing connection.
and pressing the <TAB>
key.
connection.list_tables()
['countries', 'gdp', 'independence']
These two tables include data about countries, and about GDP by country and year.
The data from countries has been obtained from GeoNames. The GDP table will be used in the next tutorial, and the data has been obtained from the World Bank website.
Next, we want to access a specific table in the database. We can create a handler to the countries
table with:
countries = connection.table('countries')
To list the columns of the countries
table, we can use the columns
attribute.
Again, Jupyter users can see all the methods and attributes of the countries
object by typing countries.
and pressing <TAB>
.
countries.columns
['iso_alpha2', 'iso_alpha3', 'iso_numeric', 'fips', 'name', 'capital', 'area_km2', 'population', 'continent']
We can now access a sample of the data. Let's focus on the name
, continent
and population
columns to start with. We can visualize the values of the columns with:
countries['name', 'continent', 'population']
name | continent | population | |
---|---|---|---|
0 | Andorra | EU | 84000 |
1 | United Arab Emirates | AS | 4975593 |
2 | Afghanistan | AS | 29121286 |
3 | Antigua and Barbuda | NA | 86754 |
4 | Anguilla | NA | 13254 |
... | ... | ... | ... |
247 | Yemen | AS | 23495361 |
248 | Mayotte | AF | 159042 |
249 | South Africa | AF | 49000000 |
250 | Zambia | AF | 13460305 |
251 | Zimbabwe | AF | 13061000 |
252 rows × 3 columns
The table is too big for all the results to be displayed, and we probably don't want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.
To check how many rows a table has, we can use the .count()
method:
countries.count()
252
To fetch just a subset of the rows, we can use the .limit(n)
method, where n
is the number of samples we want. In this case we will fetch the first 3
countries from the table:
countries['name', 'continent', 'population'].limit(3)
name | continent | population | |
---|---|---|---|
0 | Andorra | EU | 84000 |
1 | United Arab Emirates | AS | 4975593 |
2 | Afghanistan | AS | 29121286 |
Filters and order¶
Now that we've got an intuition of the data available in the table countries
, we will extract some information from it by applying filters and sorting the data.
Let's focus on a single continent. We can see a list of unique continents in the table using the .distinct()
method:
countries[['continent']].distinct()
continent | |
---|---|
0 | EU |
1 | AS |
2 | NA |
3 | AF |
4 | AN |
5 | SA |
6 | OC |
We will focus on Asia (AS
in the table). We can identify which rows belong to Asian countries using the standard Python ==
operator:
countries['continent'] == 'AS'
tmp | |
---|---|
0 | False |
1 | True |
2 | True |
3 | False |
4 | False |
... | ... |
247 | True |
248 | False |
249 | False |
250 | False |
251 | False |
252 rows × 1 columns
The result has a value True
for rows where the condition is true, and the value False
when it's not.
We can provide this expression to the method .filter()
, and save the result in the variable asian_countries
for future use.
asian_countries = countries['name', 'continent', 'population'].filter(
countries['continent'] == 'AS'
)
asian_countries
name | continent | population | |
---|---|---|---|
0 | United Arab Emirates | AS | 4975593 |
1 | Afghanistan | AS | 29121286 |
2 | Armenia | AS | 2968000 |
3 | Azerbaijan | AS | 8303512 |
4 | Bangladesh | AS | 156118464 |
5 | Bahrain | AS | 738004 |
6 | Brunei | AS | 395027 |
7 | Bhutan | AS | 699847 |
8 | Cocos [Keeling] Islands | AS | 628 |
9 | China | AS | 1330044000 |
10 | Georgia | AS | 4630000 |
11 | Hong Kong | AS | 6898686 |
12 | Indonesia | AS | 242968342 |
13 | Israel | AS | 7353985 |
14 | India | AS | 1173108018 |
15 | British Indian Ocean Territory | AS | 4000 |
16 | Iraq | AS | 29671605 |
17 | Iran | AS | 76923300 |
18 | Jordan | AS | 6407085 |
19 | Japan | AS | 127288000 |
20 | Kyrgyzstan | AS | 5776500 |
21 | Cambodia | AS | 14453680 |
22 | North Korea | AS | 22912177 |
23 | South Korea | AS | 48422644 |
24 | Kuwait | AS | 2789132 |
25 | Kazakhstan | AS | 15340000 |
26 | Laos | AS | 6368162 |
27 | Lebanon | AS | 4125247 |
28 | Sri Lanka | AS | 21513990 |
29 | Myanmar | AS | 53414374 |
30 | Mongolia | AS | 3086918 |
31 | Macao | AS | 449198 |
32 | Maldives | AS | 395650 |
33 | Malaysia | AS | 28274729 |
34 | Nepal | AS | 28951852 |
35 | Oman | AS | 2967717 |
36 | Philippines | AS | 99900177 |
37 | Pakistan | AS | 184404791 |
38 | Palestine | AS | 3800000 |
39 | Qatar | AS | 840926 |
40 | Saudi Arabia | AS | 25731776 |
41 | Singapore | AS | 4701069 |
42 | Syria | AS | 22198110 |
43 | Thailand | AS | 67089500 |
44 | Tajikistan | AS | 7487489 |
45 | Turkmenistan | AS | 4940916 |
46 | Turkey | AS | 77804122 |
47 | Taiwan | AS | 22894384 |
48 | Uzbekistan | AS | 27865738 |
49 | Vietnam | AS | 89571130 |
50 | Yemen | AS | 23495361 |
We can check how many countries exist in Asia (based on the information in the database) by using the .count()
method we've already seen:
asian_countries.count()
51
Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column population
, and just fetch the first 10. To sort by a column in Ibis, we can use the .sort_by()
method:
asian_countries.sort_by('population').limit(10)
name | continent | population | |
---|---|---|---|
0 | Cocos [Keeling] Islands | AS | 628 |
1 | British Indian Ocean Territory | AS | 4000 |
2 | Brunei | AS | 395027 |
3 | Maldives | AS | 395650 |
4 | Macao | AS | 449198 |
5 | Bhutan | AS | 699847 |
6 | Bahrain | AS | 738004 |
7 | Qatar | AS | 840926 |
8 | Kuwait | AS | 2789132 |
9 | Oman | AS | 2967717 |
This will return the least populated countries, since .sort_by
will by default order in ascending order (ascending order like in 1, 2, 3, 4
). This behavior is consistent with SQL ORDER BY
.
To order in descending order we can use ibis.desc()
:
asian_countries.sort_by(ibis.desc('population')).limit(10)
name | continent | population | |
---|---|---|---|
0 | China | AS | 1330044000 |
1 | India | AS | 1173108018 |
2 | Indonesia | AS | 242968342 |
3 | Pakistan | AS | 184404791 |
4 | Bangladesh | AS | 156118464 |
5 | Japan | AS | 127288000 |
6 | Philippines | AS | 99900177 |
7 | Vietnam | AS | 89571130 |
8 | Turkey | AS | 77804122 |
9 | Iran | AS | 76923300 |
This is the list of the 10 most populated countries based on the data from GeoNames.
To learn more about Ibis, continue to the next tutorial.