{
“cells”: [
{

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“# Introduction to Ibisn”, “n”, “Ibis is a Python framework to access data and perform analytical computations from different sources, in a standard way.n”, “n”, “In a way, you can think of Ibis as writing SQL in Python, with a focus on analytics, more than simply accessing data. And aside from SQL databases, you can use it with other backends, including big data systems.n”, “n”, “Why not simply use SQL instead? SQL is great and widely used. However, SQL has different flavors for different database engines, and SQL is very difficult to maintain when your queries are very complex. Ibis solves both problems by standardizing your code across backends and making it maintainable. Since Ibis is Python, you can structure your code in different files, functions, name variables, write tests, etc.n”, “n”, “This tutorial will guide you through Ibis features and provide practical examples. Some knowledge of Python is assumed and knowledge of SQL will be helpful but not required.n”, “n”, “Ibis is open source - if anything can be improved in this tutorial, or in Ibis itself, please open an issue in the [Ibis GitHub repository](https://github.com/ibis-project/ibis/) or open a pull request with the fix.”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“### Getting startedn”, “n”, “To start using Ibis, you need a Python environment with Ibis installed. If you don’t know how to create an environment, we recommend following the [setup instructions](https://ibis-project.org/getting_started.html) in the Ibis website.n”, “n”, “Once you have your environment ready, to start using Ibis simply import the ibis module:”

]

}, {

“cell_type”: “code”, “execution_count”: 1, “metadata”: {}, “outputs”: [], “source”: [

“import ibis”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“To make it things easier in this tutorial, we will be using _Ibis interactive mode_. For production code, that will rarely be the case. More details on Ibis non-interactive (aka lazy) mode are covered in the third tutorial, _Expressions, lazy mode and logging queries_.n”, “n”, “To set the interactive mode, use:”

]

}, {

“cell_type”: “code”, “execution_count”: 2, “metadata”: {}, “outputs”: [], “source”: [

“ibis.options.interactive = True”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next thing we need is to create a connection object. The connection defines where the data is stored and where the computations will be performed.n”, “n”, “For a comparison to pandas, this is not the same as where the data is imported from (e.g. pandas.read_sql). pandas loads data into memory and performs the computations itself. Ibis won’t load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will _ask_ the backend to perform the computations.n”, “n”, “In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But Ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL,…).n”, “n”, “To create a SQL connection to our example SQLite database data/geography.db, use:”

]

}, {

“cell_type”: “code”, “execution_count”: 3, “metadata”: {}, “outputs”: [], “source”: [

“import osn”, “n”, “database_file_path = os.path.join(‘data’, ‘geography.db’)n”, “n”, “connection = ibis.sqlite.connect(database_file_path)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Note that if you installed Ibis with pip instead of conda, you may need to install the SQLite backend separately with pip install ibis-framework[sqlite].n”, “n”, “### Exploring the datan”, “n”, “To list the tables in the connection object, we can use the .list_tables() method. If you are using Jupyter, you can see all the methods and attributes of the connection object by writing connection. and pressing the <TAB> key.”

]

}, {

“cell_type”: “code”, “execution_count”: 4, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“[‘countries’, ‘gdp’]”

]

}, “execution_count”: 4, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“connection.list_tables()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“These two tables include data about countries, and about GDP by country and year.n”, “n”, “The data from countries has been obtained from [GeoNames](https://www.geonames.org/countries/).n”, “The GDP table will be used in the next tutorial, and the data has been obtained from then”, “[World Bank website](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD).n”, “n”, “Next, we want to access a specific table in the database. We can create a handler to the countries table with:”

]

}, {

“cell_type”: “code”, “execution_count”: 5, “metadata”: {}, “outputs”: [], “source”: [

“countries = connection.table(‘countries’)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“To list the columns of the countries table, we can use the columns attribute.n”, “n”, “Again, Jupyter users can see all the methods and attributes of the countries object by typing countries. and pressing <TAB>.”

]

}, {

“cell_type”: “code”, “execution_count”: 6, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“[‘iso_alpha2’,n”, ” ‘iso_alpha3’,n”, ” ‘iso_numeric’,n”, ” ‘fips’,n”, ” ‘name’,n”, ” ‘capital’,n”, ” ‘area_km2’,n”, ” ‘population’,n”, ” ‘continent’]”

]

}, “execution_count”: 6, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries.columns”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We can now access a sample of the data. Let’s focus on the name, continent and population columns to start with. We can visualize the values of the columns with:”

]

}, {

“cell_type”: “code”, “execution_count”: 7, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent populationn”, “0 Andorra EU 84000n”, “1 United Arab Emirates AS 4975593n”, “2 Afghanistan AS 29121286n”, “3 Antigua and Barbuda NA 86754n”, “4 Anguilla NA 13254n”, “.. … … …n”, “247 Yemen AS 23495361n”, “248 Mayotte AF 159042n”, “249 South Africa AF 49000000n”, “250 Zambia AF 13460305n”, “251 Zimbabwe AF 13061000n”, “n”, “[252 rows x 3 columns]”

]

}, “execution_count”: 7, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘name’, ‘continent’, ‘population’]”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“The table is too big for all the results to be displayed, and we probably don’t want to see all of them at once anyway. For this reason, just the beginning and the end of the results is displayed. Often, the number of rows will be so large that this operation could take a long time.n”, “n”, “To check how many rows a table has, we can use the .count() method:”

]

}, {

“cell_type”: “code”, “execution_count”: 8, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“252”

]

}, “execution_count”: 8, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries.count()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“To fetch just a subset of the rows, we can use the .limit(n) method, where n is the number of samples we want. In this case we will fetch the first 3 countries from the table:”

]

}, {

“cell_type”: “code”, “execution_count”: 9, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent populationn”, “0 Andorra EU 84000n”, “1 United Arab Emirates AS 4975593n”, “2 Afghanistan AS 29121286”

]

}, “execution_count”: 9, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘name’, ‘continent’, ‘population’].limit(3)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“### Filters and ordern”, “n”, “Now that we’ve got an intuition of the data available in the table countries, we will extract some information from it by applying filters and sorting the data.n”, “n”, “Let’s focus on a single continent. We can see a list of unique continents in the table using the .distinct() method:”

]

}, {

“cell_type”: “code”, “execution_count”: 10, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“0 EUn”, “1 ASn”, “2 NAn”, “3 AFn”, “4 ANn”, “5 SAn”, “6 OCn”, “Name: continent, dtype: object”

]

}, “execution_count”: 10, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘continent’].distinct()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We will focus on Asia (AS in the table). We can identify which rows belong to Asian countries using the standard Python == operator:”

]

}, {

“cell_type”: “code”, “execution_count”: 11, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“0 Falsen”, “1 Truen”, “2 Truen”, “3 Falsen”, “4 Falsen”, ” … n”, “247 Truen”, “248 Falsen”, “249 Falsen”, “250 Falsen”, “251 Falsen”, “Name: tmp, Length: 252, dtype: bool”

]

}, “execution_count”: 11, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“countries[‘continent’] == ‘AS’”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“The result has a value True for rows where the condition is true, and the value False when it’s not.n”, “n”, “We can provide this expression to the method .filter(), and save the result in the variable asian_countries for future use.”

]

}, {

“cell_type”: “code”, “execution_count”: 12, “metadata”: {

“scrolled”: true

}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent populationn”, “0 United Arab Emirates AS 4975593n”, “1 Afghanistan AS 29121286n”, “2 Armenia AS 2968000n”, “3 Azerbaijan AS 8303512n”, “4 Bangladesh AS 156118464n”, “5 Bahrain AS 738004n”, “6 Brunei AS 395027n”, “7 Bhutan AS 699847n”, “8 Cocos [Keeling] Islands AS 628n”, “9 China AS 1330044000n”, “10 Georgia AS 4630000n”, “11 Hong Kong AS 6898686n”, “12 Indonesia AS 242968342n”, “13 Israel AS 7353985n”, “14 India AS 1173108018n”, “15 British Indian Ocean Territory AS 4000n”, “16 Iraq AS 29671605n”, “17 Iran AS 76923300n”, “18 Jordan AS 6407085n”, “19 Japan AS 127288000n”, “20 Kyrgyzstan AS 5776500n”, “21 Cambodia AS 14453680n”, “22 North Korea AS 22912177n”, “23 South Korea AS 48422644n”, “24 Kuwait AS 2789132n”, “25 Kazakhstan AS 15340000n”, “26 Laos AS 6368162n”, “27 Lebanon AS 4125247n”, “28 Sri Lanka AS 21513990n”, “29 Myanmar AS 53414374n”, “30 Mongolia AS 3086918n”, “31 Macao AS 449198n”, “32 Maldives AS 395650n”, “33 Malaysia AS 28274729n”, “34 Nepal AS 28951852n”, “35 Oman AS 2967717n”, “36 Philippines AS 99900177n”, “37 Pakistan AS 184404791n”, “38 Palestine AS 3800000n”, “39 Qatar AS 840926n”, “40 Saudi Arabia AS 25731776n”, “41 Singapore AS 4701069n”, “42 Syria AS 22198110n”, “43 Thailand AS 67089500n”, “44 Tajikistan AS 7487489n”, “45 Turkmenistan AS 4940916n”, “46 Turkey AS 77804122n”, “47 Taiwan AS 22894384n”, “48 Uzbekistan AS 27865738n”, “49 Vietnam AS 89571130n”, “50 Yemen AS 23495361”

]

}, “execution_count”: 12, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“asian_countries = countries[‘name’, ‘continent’, ‘population’].filter(countries[‘continent’] == ‘AS’)n”, “asian_countries”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We can check how many countries exist in Asia (based on the information in the database) by using the .count() method we’ve already seen:”

]

}, {

“cell_type”: “code”, “execution_count”: 13, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

“51”

]

}, “execution_count”: 13, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“asian_countries.count()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Next, we want to find the most populated countries in Asia. To obtain them, we are going to sort the countries by the column population, and just fetch the first 10. To sort by a column in Ibis, we can use the .sort_by() method:”

]

}, {

“cell_type”: “code”, “execution_count”: 14, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent populationn”, “0 Cocos [Keeling] Islands AS 628n”, “1 British Indian Ocean Territory AS 4000n”, “2 Brunei AS 395027n”, “3 Maldives AS 395650n”, “4 Macao AS 449198n”, “5 Bhutan AS 699847n”, “6 Bahrain AS 738004n”, “7 Qatar AS 840926n”, “8 Kuwait AS 2789132n”, “9 Oman AS 2967717”

]

}, “execution_count”: 14, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“asian_countries.sort_by(‘population’).limit(10)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“This will return the least populated countries, since .sort_by will by default order in ascending order (ascending order like in 1, 2, 3, 4). This behavior is consistent with SQL ORDER BY.n”, “n”, “To order in descending order we can use ibis.desc():”

]

}, {

“cell_type”: “code”, “execution_count”: 15, “metadata”: {}, “outputs”: [

{
“data”: {
“text/plain”: [

” name continent populationn”, “0 China AS 1330044000n”, “1 India AS 1173108018n”, “2 Indonesia AS 242968342n”, “3 Pakistan AS 184404791n”, “4 Bangladesh AS 156118464n”, “5 Japan AS 127288000n”, “6 Philippines AS 99900177n”, “7 Vietnam AS 89571130n”, “8 Turkey AS 77804122n”, “9 Iran AS 76923300”

]

}, “execution_count”: 15, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“asian_countries.sort_by(ibis.desc(‘population’)).limit(10)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“This is the list of the 10 most populated countries based on the data from [GeoNames](https://www.geonames.org/).n”, “n”, “To learn more about Ibis, continue to the next tutorial.”

]

}

], “metadata”: {

“kernelspec”: {

“display_name”: “Python 3”, “language”: “python”, “name”: “python3”

}, “language_info”: {

“codemirror_mode”: {

“name”: “ipython”, “version”: 3

}, “file_extension”: “.py”, “mimetype”: “text/x-python”, “name”: “python”, “nbconvert_exporter”: “python”, “pygments_lexer”: “ipython3”, “version”: “3.7.6”

}

}, “nbformat”: 4, “nbformat_minor”: 4

}