Introduction to Ibis¶
Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.
In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.
Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.
This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.
Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the Ibis GitHub repository or open a pull request with the fix.
To start using Ibis, you need a Python environment with Ibis installed. If you don’t know how to create an environment, we recommend following the setup instructions in the Ibis website.
Once you have your environment ready, to start using Ibis simply import the
To make it things easier in this tutorial, we will be using Ibis interactive mode. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, Expressions, lazy mode and logging queries.
To set the interactive mode, use:
ibis.options.interactive = True
Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.
For a comparison to pandas, this is not the same as where the data is imported from (e.g.
pandas.read_sql). pandas loads data into memory and performs the computations itself. Ibis won’t load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.
In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,…).
To create a SQL connection to our example SQLite database
import os database_file_path = os.path.join('data', 'geography.db') connection = ibis.sqlite.connect(database_file_path)
Note that if you installed Ibis with
pip instead of
conda, you may need to install the SQLite backend separately with
pip install ibis-framework[sqlite].
Exploring the data¶
To list the tables in the
connection object, we can use the
.list_tables() method. If you are using Jupyter, you can see all the methods and attributes of the
connection object by writing
connection. and pressing the
These two tables include data about countries, and about GDP by country and year.
Next, we want to access a specific table in the database. We can create a handler to the
countries table with:
countries = connection.table('countries')
To list the columns of the
countries table, we can use the
Again, Jupyter users can see all the methods and attributes of the
countries object by typing
countries. and pressing
['iso_alpha2', 'iso_alpha3', 'iso_numeric', 'fips', 'name', 'capital', 'area_km2', 'population', 'continent']
We can now access a sample of the data. Let’s focus on the
population columns to start with. We can visualize the values of the columns with:
countries['name', 'continent', 'population']
name continent population 0 Andorra EU 84000 1 United Arab Emirates AS 4975593 2 Afghanistan AS 29121286 3 Antigua and Barbuda NA 86754 4 Anguilla NA 13254 .. ... ... ... 247 Yemen AS 23495361 248 Mayotte AF 159042 249 South Africa AF 49000000 250 Zambia AF 13460305 251 Zimbabwe AF 13061000 [252 rows x 3 columns]
The table is too big for all the results to be displayed, and we probably don’t want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.
To check how many rows a table has, we can use the
To fetch just a subset of the rows, we can use the
.limit(n) method, where
n is the number of samples we want. In this case we will fetch the first
3 countries from the table:
countries['name', 'continent', 'population'].limit(3)
name continent population 0 Andorra EU 84000 1 United Arab Emirates AS 4975593 2 Afghanistan AS 29121286
Filters and order¶
Now that we’ve got an intuition of the data available in the table
countries, we will extract some information from it by applying filters and sorting the data.
Let’s focus on a single continent. We can see a list of unique continents in the table using the
0 EU 1 AS 2 NA 3 AF 4 AN 5 SA 6 OC Name: continent, dtype: object
We will focus on Asia (
AS in the table). We can identify which rows belong to Asian countries using the standard Python
countries['continent'] == 'AS'
0 False 1 True 2 True 3 False 4 False ... 247 True 248 False 249 False 250 False 251 False Name: tmp, Length: 252, dtype: bool
The result has a value
True for rows where the condition is true, and the value
False when it’s not.
We can provide this expression to the method
.filter(), and save the result in the variable
asian_countries for future use.
asian_countries = countries['name', 'continent', 'population'].filter(countries['continent'] == 'AS') asian_countries
name continent population 0 United Arab Emirates AS 4975593 1 Afghanistan AS 29121286 2 Armenia AS 2968000 3 Azerbaijan AS 8303512 4 Bangladesh AS 156118464 5 Bahrain AS 738004 6 Brunei AS 395027 7 Bhutan AS 699847 8 Cocos [Keeling] Islands AS 628 9 China AS 1330044000 10 Georgia AS 4630000 11 Hong Kong AS 6898686 12 Indonesia AS 242968342 13 Israel AS 7353985 14 India AS 1173108018 15 British Indian Ocean Territory AS 4000 16 Iraq AS 29671605 17 Iran AS 76923300 18 Jordan AS 6407085 19 Japan AS 127288000 20 Kyrgyzstan AS 5776500 21 Cambodia AS 14453680 22 North Korea AS 22912177 23 South Korea AS 48422644 24 Kuwait AS 2789132 25 Kazakhstan AS 15340000 26 Laos AS 6368162 27 Lebanon AS 4125247 28 Sri Lanka AS 21513990 29 Myanmar AS 53414374 30 Mongolia AS 3086918 31 Macao AS 449198 32 Maldives AS 395650 33 Malaysia AS 28274729 34 Nepal AS 28951852 35 Oman AS 2967717 36 Philippines AS 99900177 37 Pakistan AS 184404791 38 Palestine AS 3800000 39 Qatar AS 840926 40 Saudi Arabia AS 25731776 41 Singapore AS 4701069 42 Syria AS 22198110 43 Thailand AS 67089500 44 Tajikistan AS 7487489 45 Turkmenistan AS 4940916 46 Turkey AS 77804122 47 Taiwan AS 22894384 48 Uzbekistan AS 27865738 49 Vietnam AS 89571130 50 Yemen AS 23495361
We can check how many countries exist in Asia (based on the information in the database) by using the
.count() method we’ve already seen:
Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column
population, and just fetch the first 10. To sort by a column in Ibis, we can use the
name continent population 0 Cocos [Keeling] Islands AS 628 1 British Indian Ocean Territory AS 4000 2 Brunei AS 395027 3 Maldives AS 395650 4 Macao AS 449198 5 Bhutan AS 699847 6 Bahrain AS 738004 7 Qatar AS 840926 8 Kuwait AS 2789132 9 Oman AS 2967717
This will return the least populated countries, since
.sort_by will by default order in ascending order (ascending order like in
1, 2, 3, 4). This behavior is consistent with SQL
To order in descending order we can use
name continent population 0 China AS 1330044000 1 India AS 1173108018 2 Indonesia AS 242968342 3 Pakistan AS 184404791 4 Bangladesh AS 156118464 5 Japan AS 127288000 6 Philippines AS 99900177 7 Vietnam AS 89571130 8 Turkey AS 77804122 9 Iran AS 76923300
This is the list of the 10 most populated countries based on the data from GeoNames.
To learn more about Ibis, continue to the next tutorial.