Tutorial: getting started

This is a quick tour of some basic commands and usage patterns, just to get your flippers wet.

Tip

You can run this tutorial in a GitHub Codespace with everything setup for you:

Install Ibis

We recommend starting with the default (DuckDB) backend for a performant, fully-featured local experience. You can install Ibis with pip, conda, mamba, or pixi.

pip install 'ibis-framework[duckdb,examples]'
Warning

Note that the ibis-framework package is not the same as the ibis package in PyPI. These two libraries cannot coexist in the same Python environment, as they are both imported with the ibis module name.

conda install -c conda-forge ibis-duckdb
mamba install -c conda-forge ibis-duckdb
pixi add ibis-duckdb

Create a database file

Ibis can work with several file types, but at its core it connects to existing databases and interacts with the data there. We’ll analyze the Palmer penguins 1 dataset with DuckDB to get the hang of this.

import ibis

con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
    "penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite=True
)
DatabaseTable: penguins.main.penguins
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

You can now see the example dataset copied over to the database:

con.list_tables()
['penguins']

There’s one table, called penguins. We can ask Ibis to give us an object that we can interact with.

penguins = con.table("penguins")
penguins
DatabaseTable: penguins
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

Ibis is lazily evaluated, so instead of seeing the data, we see the schema of the table, instead. To peek at the data, we can call head and then to_pandas to get the first few rows of the table as a pandas DataFrame.

penguins.head().to_pandas()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN None 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

to_pandas takes the existing lazy table expression and evaluates it. If we leave it off, you’ll see the Ibis representation of the table expression that to_pandas will evaluate (when you’re ready!).

penguins.head()
r0 := DatabaseTable: penguins
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

Limit[r0, n=5]
Note

Ibis returns results as a pandas DataFrame using to_pandas, but isn’t using pandas to perform any of the computation. The query is executed by the backend (DuckDB in this case). Only when to_pandas is called does Ibis then pull back the results and convert them into a DataFrame.

Interactive mode

For the rest of this intro, we’ll turn on interactive mode, which partially executes queries to give users a preview of the results. There is a small difference in the way the output is formatted, but otherwise this is the same as calling to_pandas on the table expression with a limit of 10 result rows returned.

ibis.options.interactive = True
penguins.head()
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Common operations

Ibis has a collection of useful table methods to manipulate and query the data in a table (or tables).

filter

filter allows you to select rows based on a condition or set of conditions.

We can filter so we only have penguins of the species Adelie:

penguins.filter(penguins.species == "Adelie")
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen34.118.11933475NULL2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Or filter for Adelie penguins that reside on the island of Torgersen:

penguins.filter((penguins.island == "Torgersen") & (penguins.species == "Adelie"))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen34.118.11933475NULL2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

You can use any boolean comparison in a filter (although if you try to do something like use < on a string, Ibis will yell at you).

select

Your data analysis might not require all the columns present in a given table. select lets you pick out only those columns that you want to work with.

To select a column you can use the name of the column as a string:

penguins.select("species", "island", "year")
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species  island     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├─────────┼───────────┼───────┤
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│  │
└─────────┴───────────┴───────┘

Or you can use column objects directly (this can be convenient when paired with tab-completion):

penguins.select(penguins.species, penguins.island, penguins.year)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species  island     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├─────────┼───────────┼───────┤
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│  │
└─────────┴───────────┴───────┘

Or you can mix-and-match:

penguins.select("species", "island", penguins.year)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species  island     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├─────────┼───────────┼───────┤
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│ Adelie Torgersen2007 │
│  │
└─────────┴───────────┴───────┘

mutate

mutate lets you add new columns to your table, derived from the values of existing columns.

penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year   bill_length_cm ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64float64        │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼────────────────┤
│ Adelie Torgersen39.118.71813750male  20073.91 │
│ Adelie Torgersen39.517.41863800female20073.95 │
│ Adelie Torgersen40.318.01953250female20074.03 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007NULL │
│ Adelie Torgersen36.719.31933450female20073.67 │
│ Adelie Torgersen39.320.61903650male  20073.93 │
│ Adelie Torgersen38.917.81813625female20073.89 │
│ Adelie Torgersen39.219.61954675male  20073.92 │
│ Adelie Torgersen34.118.11933475NULL20073.41 │
│ Adelie Torgersen42.020.21904250NULL20074.20 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴────────────────┘

Notice that the table is a little too wide to display all the columns now (depending on your screen-size). bill_length is now present in millimeters AND centimeters. Use a select to trim down the number of columns we’re looking at.

penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10).select(
    "species",
    "island",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
    "sex",
    "year",
    "bill_length_cm",
)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species  island     bill_depth_mm  flipper_length_mm  body_mass_g  sex     year   bill_length_cm ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ stringstringfloat64int64int64stringint64float64        │
├─────────┼───────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼────────────────┤
│ Adelie Torgersen18.71813750male  20073.91 │
│ Adelie Torgersen17.41863800female20073.95 │
│ Adelie Torgersen18.01953250female20074.03 │
│ Adelie TorgersenNULLNULLNULLNULL2007NULL │
│ Adelie Torgersen19.31933450female20073.67 │
│ Adelie Torgersen20.61903650male  20073.93 │
│ Adelie Torgersen17.81813625female20073.89 │
│ Adelie Torgersen19.61954675male  20073.92 │
│ Adelie Torgersen18.11933475NULL20073.41 │
│ Adelie Torgersen20.21904250NULL20074.20 │
│  │
└─────────┴───────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴────────────────┘

selectors

Typing out ALL of the column names except one is a little annoying. Instead of doing that again, we can use a selector to quickly select or deselect groups of columns.

import ibis.selectors as s

penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10).select(
    ~s.matches("bill_length_mm")
    # match every column except `bill_length_mm`
)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species  island     bill_depth_mm  flipper_length_mm  body_mass_g  sex     year   bill_length_cm ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ stringstringfloat64int64int64stringint64float64        │
├─────────┼───────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼────────────────┤
│ Adelie Torgersen18.71813750male  20073.91 │
│ Adelie Torgersen17.41863800female20073.95 │
│ Adelie Torgersen18.01953250female20074.03 │
│ Adelie TorgersenNULLNULLNULLNULL2007NULL │
│ Adelie Torgersen19.31933450female20073.67 │
│ Adelie Torgersen20.61903650male  20073.93 │
│ Adelie Torgersen17.81813625female20073.89 │
│ Adelie Torgersen19.61954675male  20073.92 │
│ Adelie Torgersen18.11933475NULL20073.41 │
│ Adelie Torgersen20.21904250NULL20074.20 │
│  │
└─────────┴───────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴────────────────┘

You can also use a selector alongside a column name.

penguins.select("island", s.numeric())
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┓
┃ island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  year  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━┩
│ stringfloat64float64int64int64int64 │
├───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼───────┤
│ Torgersen39.118.718137502007 │
│ Torgersen39.517.418638002007 │
│ Torgersen40.318.019532502007 │
│ TorgersenNULLNULLNULLNULL2007 │
│ Torgersen36.719.319334502007 │
│ Torgersen39.320.619036502007 │
│ Torgersen38.917.818136252007 │
│ Torgersen39.219.619546752007 │
│ Torgersen34.118.119334752007 │
│ Torgersen42.020.219042502007 │
│  │
└───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴───────┘

You can read more about selectors in the docs!

order_by

order_by arranges the values of one or more columns in ascending or descending order.

By default, ibis sorts in ascending order:

penguins.order_by(penguins.flipper_length_mm).select(
    "species", "island", "flipper_length_mm"
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species    island     flipper_length_mm ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64             │
├───────────┼───────────┼───────────────────┤
│ Adelie   Biscoe   172 │
│ Adelie   Biscoe   174 │
│ Adelie   Torgersen176 │
│ Adelie   Dream    178 │
│ Adelie   Dream    178 │
│ Adelie   Dream    178 │
│ ChinstrapDream    178 │
│ Adelie   Dream    179 │
│ Adelie   Torgersen180 │
│ Adelie   Biscoe   180 │
│  │
└───────────┴───────────┴───────────────────┘

You can sort in descending order using the desc method of a column:

penguins.order_by(penguins.flipper_length_mm.desc()).select(
    "species", "island", "flipper_length_mm"
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species  island  flipper_length_mm ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64             │
├─────────┼────────┼───────────────────┤
│ Gentoo Biscoe231 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe229 │
│ Gentoo Biscoe229 │
│  │
└─────────┴────────┴───────────────────┘

Or you can use ibis.desc

penguins.order_by(ibis.desc("flipper_length_mm")).select(
    "species", "island", "flipper_length_mm"
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species  island  flipper_length_mm ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64             │
├─────────┼────────┼───────────────────┤
│ Gentoo Biscoe231 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe230 │
│ Gentoo Biscoe229 │
│ Gentoo Biscoe229 │
│  │
└─────────┴────────┴───────────────────┘

aggregates

Ibis has several aggregate functions available to help summarize data.

mean, max, min, count, sum (the list goes on).

To aggregate an entire column, call the corresponding method on that column.

penguins.flipper_length_mm.mean()

┌────────────┐
│ 200.915205 │
└────────────┘

You can compute multiple aggregates at once using the aggregate method:

penguins.aggregate([penguins.flipper_length_mm.mean(), penguins.bill_depth_mm.max()])
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Mean(flipper_length_mm)  Max(bill_depth_mm) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ float64float64            │
├─────────────────────────┼────────────────────┤
│              200.91520521.5 │
└─────────────────────────┴────────────────────┘

But aggregate really shines when it’s paired with group_by.

group_by

group_by creates groupings of rows that have the same value for one or more columns.

But it doesn’t do much on its own – you can pair it with aggregate to get a result.

penguins.group_by("species").aggregate()
┏━━━━━━━━━━━┓
┃ species   ┃
┡━━━━━━━━━━━┩
│ string    │
├───────────┤
│ Chinstrap │
│ Gentoo    │
│ Adelie    │
└───────────┘

We grouped by the species column and handed it an “empty” aggregate command. The result of that is a column of the unique values in the species column.

If we add a second column to the group_by, we’ll get each unique pairing of the values in those columns.

penguins.group_by(["species", "island"]).aggregate()
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ species    island    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringstring    │
├───────────┼───────────┤
│ Adelie   Torgersen │
│ Adelie   Dream     │
│ Gentoo   Biscoe    │
│ Adelie   Biscoe    │
│ ChinstrapDream     │
└───────────┴───────────┘

Now, if we add an aggregation function to that, we start to really open things up.

penguins.group_by(["species", "island"]).aggregate(penguins.bill_length_mm.mean())
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ species    island     Mean(bill_length_mm) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringfloat64              │
├───────────┼───────────┼──────────────────────┤
│ Gentoo   Biscoe   47.504878 │
│ Adelie   Biscoe   38.975000 │
│ ChinstrapDream    48.833824 │
│ Adelie   Torgersen38.950980 │
│ Adelie   Dream    38.501786 │
└───────────┴───────────┴──────────────────────┘

By adding that mean to the aggregate, we now have a concise way to calculate aggregates over each of the distinct groups in the group_by. And we can calculate as many aggregates as we need.

penguins.group_by(["species", "island"]).aggregate(
    [penguins.bill_length_mm.mean(), penguins.flipper_length_mm.max()]
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species    island     Mean(bill_length_mm)  Max(flipper_length_mm) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringfloat64int64                  │
├───────────┼───────────┼──────────────────────┼────────────────────────┤
│ Adelie   Torgersen38.950980210 │
│ Adelie   Dream    38.501786208 │
│ Adelie   Biscoe   38.975000203 │
│ ChinstrapDream    48.833824212 │
│ Gentoo   Biscoe   47.504878231 │
└───────────┴───────────┴──────────────────────┴────────────────────────┘

If we need more specific groups, we can add to the group_by.

penguins.group_by(["species", "island", "sex"]).aggregate(
    [penguins.bill_length_mm.mean(), penguins.flipper_length_mm.max()]
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species    island     sex     Mean(bill_length_mm)  Max(flipper_length_mm) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringstringstringfloat64int64                  │
├───────────┼───────────┼────────┼──────────────────────┼────────────────────────┤
│ Adelie   TorgersenNULL37.925000193 │
│ Adelie   Dream    NULL37.500000179 │
│ Gentoo   Biscoe   male  49.473770231 │
│ ChinstrapDream    male  51.094118212 │
│ Gentoo   Biscoe   female45.563793222 │
│ ChinstrapDream    female46.573529202 │
│ Adelie   Torgersenfemale37.554167196 │
│ Adelie   Biscoe   female37.359091199 │
│ Adelie   Dream    female36.911111202 │
│ Adelie   Torgersenmale  40.586957210 │
│  │
└───────────┴───────────┴────────┴──────────────────────┴────────────────────────┘

Chaining it all together

We’ve already chained some Ibis calls together. We used mutate to create a new column and then select to only view a subset of the new table. We were just chaining group_by with aggregate.

There’s nothing stopping us from putting all of these concepts together to ask questions of the data.

How about:

  • What was the largest female penguin (by body mass) on each island in the year 2008?
penguins.filter((penguins.sex == "female") & (penguins.year == 2008)).group_by(
    ["island"]
).aggregate(penguins.body_mass_g.max())
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ island     Max(body_mass_g) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ stringint64            │
├───────────┼──────────────────┤
│ Torgersen3800 │
│ Dream    3900 │
│ Biscoe   5200 │
└───────────┴──────────────────┘
  • What about the largest male penguin (by body mass) on each island for each year of data collection?
penguins.filter(penguins.sex == "male").group_by(["island", "year"]).aggregate(
    penguins.body_mass_g.max().name("max_body_mass")
).order_by(["year", "max_body_mass"])
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ island     year   max_body_mass ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ stringint64int64         │
├───────────┼───────┼───────────────┤
│ Dream    20074650 │
│ Torgersen20074675 │
│ Biscoe   20076300 │
│ Torgersen20084700 │
│ Dream    20084800 │
│ Biscoe   20086000 │
│ Torgersen20094300 │
│ Dream    20094475 │
│ Biscoe   20096000 │
└───────────┴───────┴───────────────┘

Learn more

That’s all for this quick-start guide. If you want to learn more, check out the examples repository.

Back to top

Footnotes

  1. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎