import pandas as pd
import numpy as np
def create_observations(n: int, seed: int = 42) -> pd.DataFrame:
= ["Sacred Ibis", "Scarlet Ibis", "Glossy Ibis", "White Ibis"]
ibis_species = ["Wetland", "Grassland", "Coastal"]
locations = ["Feeding", "Nesting", "Flying"]
behaviors = ["Sunny", "Rainy"]
weather_conditions
# For reproducibility
np.random.seed(seed)
return pd.DataFrame(
{"observation_date": np.full(n, np.datetime64("2024-01-01"))
+ np.random.randint(0, 365, size=n).astype("timedelta64[D]"),
"species": np.random.choice(ibis_species, size=n),
"location": np.random.choice(locations, size=n),
"group_size": np.random.randint(1, 20, size=n),
"behavior": np.random.choice(behaviors, size=n),
"weather": np.random.choice(weather_conditions, size=n),
"temperature_c": np.random.normal(25, 5, size=n) # Mean 25°C, std 5°C
}
)
= create_observations(1000) ibis_observations
Have you ever wanted to harness the power of AWS Athena, but found yourself tangled up in Presto SQL syntax? Good news! Ibis now supports Amazon Athena as its newest backend, bringing you the familiar comfort of DataFrame operations while tapping into AWS’s robust data lake architecture.
Why?
There’s even more to love about this integration. Athena’s pay-per-query pricing model means that users pay for each query they run. With Ibis’ query optimisation before execution, you can potentially reduce costs without needing to agonise over query efficiency. Plus, since Athena can query data directly from S3, this new backend lets you analyse your data lake contents with beloved Python libraries like PyArrow and pandas without the hassle of downloading or moving massive datasets.
Installation Prerequisites
Make sure you have an IAM account and that your credentials are in an expected location in your local environment.
Additionally, using the same account and region that you are using for Athena, you will need to create an S3 bucket where Athena can dump query results. This bucket will be set to s3_staging_dir
in the connection call to the Athena backend.
If you are not able to query Athena through awscli, your queries will similarly not work on Ibis. Please note that AWS charges will apply for queries to Athena executed in following this tutorial.
Installation
Install Ibis with the dependencies needed to work with AWS Athena:
$ pip install 'ibis-framework[athena]'
Data
We are going to be creating some sample ecological data about ibis behaviour. The data will contain multiple columns with information about species, location, weather, group size, behaviour, and location temperature.
Demo
Let’s start by opening a connection to AWS Athena with Ibis, using the S3 bucket we created to store query results.
from ibis.interactive import *
= ibis.athena.connect(
con ="s3://aws-athena-query-results-ibis-testing",
s3_staging_dir="us-east-2",
region_name )
Let’s create some data using our ibis_observations
pandas DataFrame.
"mydatabase", force=True)
con.create_database("ibis_observations", force=True)
con.drop_table("ibis_observations", obj=ibis_observations, database="mydatabase")
con.create_table(="mydatabase") con.list_tables(database
['ibis_observations']
And we can grab information about table schemas to help us out with our queries:
"ibis_observations", database="mydatabase") con.get_schema(
ibis.Schema {
observation_date timestamp
species string
location string
group_size int64
behavior string
weather string
temperature_c float64
}
And now we are able to grab the table, and make some Ibis queries! Like what is the average group size by species?
= con.table("ibis_observations", database="mydatabase")
t
# Average group size by species
"species").aggregate(avg_group=t.group_size.mean()) t.group_by(
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ species ┃ avg_group ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ string │ float64 │ ├──────────────┼───────────┤ │ Glossy Ibis │ 10.003906 │ │ Sacred Ibis │ 10.030075 │ │ Scarlet Ibis │ 9.673307 │ │ White Ibis │ 10.259912 │ └──────────────┴───────────┘
And ibis does all the work on generating the Presto SQL that Athena can understand:
How about most common behaviour during rainy weather?
(filter(t.weather == "Rainy")
t."behavior")
.group_by(=lambda t: t.count())
.aggregate(count"count"))
.order_by(ibis.desc( )
┏━━━━━━━━━━┳━━━━━━━┓ ┃ behavior ┃ count ┃ ┡━━━━━━━━━━╇━━━━━━━┩ │ string │ int64 │ ├──────────┼───────┤ │ Feeding │ 172 │ │ Nesting │ 168 │ │ Flying │ 166 │ └──────────┴───────┘
Temperature effects on behaviour?
"behavior").aggregate(avg_temp=t.temperature_c.mean()).order_by("avg_temp") t.group_by(
┏━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ behavior ┃ avg_temp ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━┩ │ string │ float64 │ ├──────────┼───────────┤ │ Feeding │ 25.006455 │ │ Flying │ 25.116323 │ │ Nesting │ 25.133050 │ └──────────┴───────────┘
Now that we’re nearing the end of this demo, I wanted to show you that you can also delete tables and databases using ibis:
"ibis_observations", database="mydatabase")
con.drop_table("mydatabase")
con.drop_database( con.disconnect()
You wouldn’t need to fiddle with Athena’s SDK!
How does this all work?
Under the hood, AWS Athena runs on a version of Trino (formerly known as Presto SQL). Instead of writing a completely new SQL compiler for Athena, we were able to leverage Ibis’ existing Trino compiler with some careful adjustments.
This provides significant benefits in code efficiency - the Athena backend implementation required only about 40 lines of unique code.
There are some nuances to note: since Athena runs on an older version of Trino, not all of Trino’s newest features are available. For a detailed comparison of supported features across different backends, please check out the Ibis backend support matrix.
If you’re new here, welcome. Here are some resources to learn more about Ibis:
Chat with us on Zulip: