.. currentmodule:: ibis
.. _sql:
************************
Ibis for SQL Programmers
************************
Among other things, Ibis provides a full-featured replacement for SQL
``SELECT`` queries, but expressed with Python code that is:
* Type-checked and validated as you go. No more debugging cryptic database
errors; Ibis catches your mistakes right away.
* Easier to write. Pythonic function calls with tab completion in IPython.
* More composable. Break complex queries down into easier-to-digest pieces
* Easier to reuse. Mix and match Ibis snippets to create expressions tailored
for your analysis.
We intend for all ``SELECT`` queries to be fully portable to Ibis. Coverage of
other DDL statements (e.g. ``CREATE TABLE`` or ``INSERT``) may vary from
engine to engine.
This document will use the Impala SQL compiler (i.e. ``ibis.impala.compile``)
for convenience, but the code here is portable to whichever system you are
using Ibis with.
**Note**: If you find any SQL idioms or use cases in your work that are not
represented here, please reach out so we can add more to this guide!
.. ipython:: python
:suppress:
import ibis
ibis.options.sql.default_limit = None
Projections: select/add/remove columns
--------------------------------------
All tables in Ibis are immutable. To select a subset of a table's columns, or
to add new columns, you must produce a new table by means of a *projection*.
.. ipython:: python
t = ibis.table([('one', 'string'),
('two', 'double'),
('three', 'int32')], 'my_data')
t
In SQL, you might write something like:
.. code-block:: sql
SELECT two, one
FROM my_data
In Ibis, this is
.. ipython:: python
proj = t['two', 'one']
or
.. ipython:: python
proj = t.projection(['two', 'one'])
This generates the expected SQL:
.. ipython:: python
print(ibis.impala.compile(proj))
What about adding new columns? To form a valid projection, all column
expressions must be **named**. Let's look at the SQL:
.. code-block:: sql
SELECT two, one, three * 2 AS new_col
FROM my_data
The last expression is written:
.. ipython:: python
new_col = (t.three * 2).name('new_col')
Now, we have:
.. ipython:: python
proj = t['two', 'one', new_col]
print(ibis.impala.compile(proj))
``mutate``: Add or modify columns easily
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since adding new columns or modifying existing columns is so common, there is a
convenience method ``mutate``:
.. ipython:: python
mutated = t.mutate(new_col=t.three * 2)
Notice that using the ``name`` was not necessary here because we're using
Python keywords to provide the name. Indeed:
.. ipython:: python
print(ibis.impala.compile(mutated))
If you modify an existing column with ``mutate`` it will list out all the other
columns:
.. ipython:: python
mutated = t.mutate(two=t.two * 2)
print(ibis.impala.compile(mutated))
``SELECT *`` equivalent
~~~~~~~~~~~~~~~~~~~~~~~
Especially in combination with relational joins, it's convenient to be able to
select all columns in a table using the ``SELECT *`` construct. To do this, use
the table expression itself in a projection:
.. ipython:: python
proj = t[t]
print(ibis.impala.compile(proj))
This is how ``mutate`` is implemented. The example above
``t.mutate(new_col=t.three * 2)`` can be written as a normal projection:
.. ipython:: python
proj = t[t, new_col]
print(ibis.impala.compile(proj))
Let's consider a table we might wish to join with ``t``:
.. ipython:: python
t2 = ibis.table([('key', 'string'),
('value', 'double')], 'dim_table')
Now let's take the SQL:
.. code-block:: sql
SELECT t0.*, t0.two - t1.value AS diff
FROM my_data t0
INNER JOIN dim_table t1
ON t0.one = t1.key
To write this with Ibis, it is:
.. ipython:: python
diff = (t.two - t2.value).name('diff')
joined = t.join(t2, t.one == t2.key)[t, diff]
And verify the generated SQL:
.. ipython:: python
print(ibis.impala.compile(joined))
Using functions in projections
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you pass a function instead of a string or Ibis expression in any projection
context, it will be invoked with the "parent" table as its argument. This can
help significantly when `composing complex operations
`_. Consider this SQL:
.. code-block:: sql
SELECT one, avg(abs(the_sum)) AS mad
FROM (
SELECT one, three, sum(two) AS the_sum
FROM my_data
GROUP BY 1, 2
) t0
GROUP BY 1
This can be written as one chained expression:
.. ipython:: python
expr = (t.group_by(['one', 'three'])
.aggregate(the_sum=t.two.sum())
.group_by('one')
.aggregate(mad=lambda x: x.the_sum.abs().mean()))
Indeed:
.. ipython:: python
print(ibis.impala.compile(expr))
Filtering / ``WHERE``
---------------------
You can add filter clauses to a table expression either by indexing with ``[]``
(like pandas) or use the ``filter`` method:
.. ipython:: python
filtered = t[t.two > 0]
print(ibis.impala.compile(filtered))
``filter`` can take a list of expressions, which must all be satisfied for a
row to be included in the result:
.. ipython:: python
filtered = t.filter([t.two > 0,
t.one.isin(['A', 'B'])])
print(ibis.impala.compile(filtered))
To compose boolean expressions with ``AND`` or ``OR``, use the respective ``&``
and ``|`` operators:
.. ipython:: python
cond = (t.two < 0) | ((t.two > 0) | t.one.isin(['A', 'B']))
filtered = t[cond]
print(ibis.impala.compile(filtered))
Aggregation / ``GROUP BY``
--------------------------
To aggregate a table, you need:
* Zero or more grouping expressions (these can be column names)
* One or more aggregation expressions
Let's look at the ``aggregate`` method on tables:
.. ipython:: python
stats = [t.two.sum().name('total_two'),
t.three.mean().name('avg_three')]
agged = t.aggregate(stats)
If you don't use any group expressions, the result will have a single row with
your statistics of interest:
.. ipython:: python
agged.schema()
print(ibis.impala.compile(agged))
To add groupings, use either the ``by`` argument of ``aggregate`` or use the
``group_by`` construct:
.. ipython:: python
agged2 = t.aggregate(stats, by='one')
agged3 = t.group_by('one').aggregate(stats)
print(ibis.impala.compile(agged3))
Non-trivial grouping keys
~~~~~~~~~~~~~~~~~~~~~~~~~
You can use any expression (or function, like in projections) deriving from the
table you are aggregating. The only constraint is that the expressions must be
named. Let's look at an example:
.. ipython:: python
events = ibis.table([('ts', 'timestamp'),
('event_type', 'int32'),
('session_id', 'int64')],
'web_events')
Suppose we wanted to total up event types by year and month:
.. ipython:: python
keys = [events.ts.year().name('year'),
events.ts.month().name('month')]
sessions = events.session_id.nunique()
stats = (events.group_by(keys)
.aggregate(total=events.count(),
sessions=sessions))
Now we have:
.. ipython:: python
print(ibis.impala.compile(stats))
.. _sql.aggregate.subsets:
Aggregates considering table subsets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In analytics is it common to compare statistics from different subsets of a
table. Let's consider a dataset containing people's name, age, gender, and
nationality:
.. ipython:: python
pop = ibis.table([('name', 'string'),
('country', 'string'),
('gender', 'string'),
('age', 'int16')], 'population')
Now, suppose you wanted to know for each country:
* Average overall age
* Average male age
* Average female age
* Total number of persons
In SQL, you may write:
.. code-block:: sql
SELECT country,
count(*) AS num_persons,
AVG(age) AS avg_age
AVG(CASE WHEN gender = 'M'
THEN age
ELSE NULL
END) AS avg_male,
AVG(CASE WHEN gender = 'F'
THEN age
ELSE NULL
END) AS avg_female,
FROM population
GROUP BY 1
Ibis makes this much simpler by giving you ``where`` option in aggregation
functions:
.. ipython:: python
expr = pop.group_by('country').aggregate(
num_persons=pop.count(),
avg_age=pop.age.mean(),
avg_male=pop.age.mean(where=pop.gender == 'M'),
avg_female=pop.age.mean(where=pop.gender == 'F')
)
This indeed generates the correct SQL. Note that SQL engines handle ``NULL``
values differently in aggregation functions, but Ibis will write the SQL
expression that is correct for your query engine.
.. ipython:: python
print(ibis.impala.compile(expr))
``count(*)`` convenience: ``size()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Computing group frequencies is so common that, like pandas, we have a method
``size`` that is a shortcut for the ``count(*)`` idiom:
.. ipython:: python
freqs = events.group_by(keys).size()
print(ibis.impala.compile(freqs))
Frequency table convenience: ``value_counts``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider the SQL idiom:
.. code-block:: sql
SELECT some_column_expression, count(*)
FROM table
GROUP BY 1
This is so common that, like pandas, there is a generic array method
``value_counts`` which does this for us:
.. ipython:: python
expr = events.ts.year().value_counts()
print(ibis.impala.compile(expr))
``HAVING`` clause
~~~~~~~~~~~~~~~~~
The SQL ``HAVING`` clause enables you to filter the results of an aggregation
based on some group-wise condition holding true. For example, suppose we wanted
to limit our analysis to groups containing at least 1000 observations:
.. code-block:: sql
SELECT one, sum(two) AS total
FROM my_data
GROUP BY 1
HAVING count(*) >= 1000
With Ibis, you can do:
.. ipython:: python
expr = (t.group_by('one')
.having(t.count() >= 1000)
.aggregate(t.two.sum().name('total')))
print(ibis.impala.compile(expr))
Sorting / ``ORDER BY``
----------------------
To sort a table, use the ``sort_by`` method along with either column names or
expressions that indicate the sorting keys:
.. ipython:: python
sorted = events.sort_by([events.ts.year(),
events.ts.month()])
print(ibis.impala.compile(sorted))
The default for sorting is in ascending order. To reverse the sort direction of
any key, either wrap it in ``ibis.desc`` or pass a tuple with ``False`` as the
second value:
.. ipython:: python
sorted = (events.sort_by([ibis.desc('event_type'),
(events.ts.month(), False)])
.limit(100))
print(ibis.impala.compile(sorted))
``LIMIT`` and ``OFFSET``
------------------------
This one is easy. The table ``limit`` function truncates a table to the
indicates number of rows. So if you only want the first 1000 rows (which may
not be deterministic depending on the SQL engine), you can do:
.. ipython:: python
limited = t.limit(1000)
print(ibis.impala.compile(limited))
The ``offset`` option in ``limit`` skips rows. So if you wanted rows 11 through
20, you could do:
.. ipython:: python
limited = t.limit(10, offset=10)
print(ibis.impala.compile(limited))
Common column expressions
-------------------------
See the full :ref:`API documentation ` for all of the available value
methods and tools for creating value expressions. We mention a few common ones
here as they relate to common SQL queries.
Type casts
~~~~~~~~~~
Ibis's type system is independent of any SQL system. You cast Ibis expressions
from one Ibis type to another. For example:
.. ipython:: python
expr = t.mutate(date=t.one.cast('timestamp'),
four=t.three.cast('double'))
print(ibis.impala.compile(expr))
``CASE`` statements
~~~~~~~~~~~~~~~~~~~
SQL dialects typically support one or more kind of ``CASE`` statements. The
first is the *simple case* that compares against exact values of an expression.
.. code-block:: sql
CASE expr
WHEN value_1 THEN result_1
WHEN value_2 THEN result_2
ELSE default
END
Value expressions in Ibis have a ``case`` method that allows us to emulate
these semantics:
.. ipython:: python
case = (t.one.cast('timestamp')
.year()
.case()
.when(2015, 'This year')
.when(2014, 'Last year')
.else_('Earlier')
.end())
expr = t.mutate(year_group=case)
print(ibis.impala.compile(expr))
The more general case is that of an arbitrary list of boolean expressions and
result values:
.. code-block:: sql
CASE
WHEN boolean_expr1 THEN result_1
WHEN boolean_expr2 THEN result_2
WHEN boolean_expr3 THEN result_3
ELSE default
END
To do this, use ``ibis.case``:
.. ipython:: python
case = (ibis.case()
.when(t.two < 0, t.three * 2)
.when(t.two > 1, t.three)
.else_(t.two)
.end())
expr = t.mutate(cond_value=case)
print(ibis.impala.compile(expr))
There are several places where Ibis builds cases for you in a simplified
way. One example is the ``ifelse`` function:
.. ipython:: python
switch = (t.two < 0).ifelse('Negative', 'Non-Negative')
expr = t.mutate(group=switch)
print(ibis.impala.compile(expr))
Using ``NULL`` in expressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To use ``NULL`` in an expression, either use the special ``ibis.NA`` value or
``ibis.null()``:
.. ipython:: python
pos_two = (t.two > 0).ifelse(t.two, ibis.NA)
expr = t.mutate(two_positive=pos_two)
print(ibis.impala.compile(expr))
Set membership: ``IN`` / ``NOT IN``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let's look again at the population dataset. Suppose you wanted to combine the
United States and Canada data into a "North America" category. Here would be
some SQL to do it:
.. code-block:: sql
CASE
WHEN upper(country) IN ('UNITED STATES', 'CANADA')
THEN 'North America'
ELSE country
END AS refined_group
The Ibis equivalent of ``IN`` is the ``isin`` method. So we can write:
.. ipython:: python
refined = (pop.country.upper()
.isin(['UNITED STATES', 'CANADA'])
.ifelse('North America', pop.country))
expr = pop.mutate(refined_group=refined)
print(ibis.impala.compile(expr))
The opposite of ``isin`` is ``notin``.
Constant and literal expressions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider a SQL expression like:
.. code-block:: sql
'foo' IN (column1, column2)
which is equivalent to
.. code-block:: sql
column1 = 'foo' OR column2 = 'foo'
To build expressions off constant values, you must first convert the value
(whether a Python string or number) to an Ibis expression using
``ibis.literal``:
.. ipython:: python
t3 = ibis.table([('column1', 'string'),
('column2', 'string'),
('column3', 'double')], 'data')
value = ibis.literal('foo')
Once you've done this, you can use the literal expression like any other array
or scalar expression:
.. ipython:: python
has_foo = value.isin([t3.column1, t3.column2])
expr = t3.mutate(has_foo=has_foo)
print(ibis.impala.compile(expr))
In many other situations, you can use constants without having to use
``ibis.literal``. For example, we could add a column containing nothing but the
number 5 like so:
.. ipython:: python
expr = t3.mutate(number5=5)
print(ibis.impala.compile(expr))
``IS NULL`` and ``IS NOT NULL``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These are simple: use the ``isnull`` and ``notnull`` functions respectively,
which yield boolean arrays:
.. ipython:: python
indic = t.two.isnull().ifelse('valid', 'invalid')
expr = t.mutate(is_valid=indic)
print(ibis.impala.compile(expr))
agged = (expr
[expr.one.notnull()]
.group_by('is_valid')
.aggregate(three_count=lambda t: t.three.notnull().sum()))
print(ibis.impala.compile(agged))
``BETWEEN``
~~~~~~~~~~~
The ``between`` method on arrays and scalars compiles to the SQL ``BETWEEN``
keyword. The result of ``between`` is boolean and can be used with any other
boolean expression:
.. ipython:: python
expr = t[t.two.between(10, 50) & t.one.notnull()]
print(ibis.impala.compile(expr))
Joins
-----
Ibis supports several kinds of joins between table expressions:
* ``inner_join``: maps to SQL ``INNER JOIN``
* ``cross_join``: a cartesian product join with no keys. Equivalent to
``inner_join`` with no join predicates
* ``left_join``: maps to SQL ``LEFT OUTER JOIN``
* ``outer_join``: maps to SQL ``FULL OUTER JOIN``
* ``semi_join``: maps to SQL ``LEFT SEMI JOIN``. May or may not be an explicit
join type in your query engine.
* ``anti_join``: maps to SQL ``LEFT ANTI JOIN``. May or may not be an explicit
join type in your query engine.
The ``join`` table method is by default the same as ``inner_join``.
Let's look at a couple example tables to see how joins work in Ibis:
.. ipython:: python
t1 = ibis.table([('value1', 'double'),
('key1', 'string'),
('key2', 'string')], 'table1')
t2 = ibis.table([('value2', 'double'),
('key3', 'string'),
('key4', 'string')], 'table2')
Let's join on one key:
.. ipython:: python
joined = t1.left_join(t2, t1.key1 == t2.key3)
The immediate result of a join does not yet have a set schema. That is
determined by the next action that you take. There's several ways forward from
here that address the spectrum of SQL use cases.
Join + projection
~~~~~~~~~~~~~~~~~
Consider the SQL:
.. code-block:: sql
SELECT t0.*, t1.value2
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
After one or more joins, you can reference any of the joined tables in a
projection immediately after:
.. ipython:: python
expr = joined[t1, t2.value2]
print(ibis.impala.compile(expr))
If you need to compute an expression that involves both tables, you can do that
also:
.. ipython:: python
expr = joined[t1.key1, (t1.value1 - t2.value2).name('diff')]
print(ibis.impala.compile(expr))
Join + aggregation
~~~~~~~~~~~~~~~~~~
You can directly aggregate a join without need for projection, which also
allows you to form statistics that reference any of the joined tables.
Consider this SQL:
.. code-block:: sql
SELECT t0.key1, avg(t0.value1 - t1.value2) AS avg_diff
FROM table1 t0
LEFT OUTER JOIN table2 t1
ON t0.key1 = t1.key3
GROUP BY 1
As you would hope, the code is as follows:
.. ipython:: python
avg_diff = (t1.value1 - t2.value2).mean()
expr = (t1.left_join(t2, t1.key1 == t2.key3)
.group_by(t1.key1)
.aggregate(avg_diff=avg_diff))
print(ibis.impala.compile(expr))
Join with ``SELECT *``
~~~~~~~~~~~~~~~~~~~~~~
If you try to compile or execute a join that has not been projected or
aggregated, it will be *fully materialized*:
.. ipython:: python
joined = t1.left_join(t2, t1.key1 == t2.key3)
print(ibis.impala.compile(joined))
You can do this explicitly using the ``materialize`` method:
.. ipython:: python
expr = joined.materialize()
print(ibis.impala.compile(expr))
Multiple joins
~~~~~~~~~~~~~~
You can join multiple tables together in succession without needing to address
any of the above concerns.
.. ipython:: python
t3 = ibis.table([('value3', 'double'),
('key5', 'string')], 'table3')
total = (t1.value1 + t2.value2 + t3.value3).sum()
expr = (t1.join(t2, [t1.key1 == t2.key3,
t1.key2 == t2.key4])
.join(t3, t1.key1 == t3.key5)
.group_by([t2.key4, t3.key5])
.aggregate(total=total))
print(ibis.impala.compile(expr))
Self joins
~~~~~~~~~~
What about when you need to join a table on itself? For example:
.. code-block:: sql
SELECT t0.one, avg(t0.two - t1.three) AS metric
FROM my_data t0
INNER JOIN my_data t1
ON t0.one = t1.one
GROUP BY 1
The table ``view`` method enables you to form a *self-reference* that is
referentially distinct in expressions. Now you can proceed normally:
.. ipython:: python
t_view = t.view()
stat = (t.two - t_view.three).mean()
expr = (t.join(t_view, t.three.cast('string') == t_view.one)
.group_by(t.one)
.aggregate(metric=stat))
print(ibis.impala.compile(expr))
Overlapping join keys
~~~~~~~~~~~~~~~~~~~~~
In many cases the columns being joined between two tables or table expressions
have the same name. Consider this example:
.. ipython:: python
t4 = ibis.table([('key1', 'string'),
('key2', 'string'),
('key3', 'string'),
('value1', 'double')], 'table4')
t5 = ibis.table([('key1', 'string'),
('key2', 'string'),
('key3', 'string'),
('value2', 'double')], 'table5')
In these case, we can specify a list of common join keys:
.. ipython:: python
joined = t4.join(t5, ['key1', 'key2', 'key3'])
expr = joined[t4, t5.value2]
print(ibis.impala.compile(expr))
You can mix the overlapping key names with other expressions:
.. ipython:: python
joined = t4.join(t5, ['key1', 'key2',
t4.key3.left(4) == t4.key3.left(4)])
expr = joined[t4, t5.value2]
print(ibis.impala.compile(expr))
Non-equality join predicates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can join tables with boolean clauses that are not equality. Some query
engines support these efficiently, some inefficiently, or some not at all. In
the latter case, these conditions get moved by Ibis into the ``WHERE`` part of
the ``SELECT`` query.
.. ipython:: python
expr = (t1.join(t2, t1.value1 < t2.value2)
.group_by(t1.key1)
.size())
print(ibis.impala.compile(expr))
Other ways to specify join keys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can also pass a list of column names instead of forming boolean expressions:
.. ipython:: python
joined = t1.join(t2, [('key1', 'key3'),
('key2', 'key4')])
Subqueries
----------
Ibis creates inline views and nested subqueries automatically. This section
concerns more complex subqueries involving foreign references and other
advanced relational algebra.
Correlated ``EXISTS`` / ``NOT EXISTS`` filters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The SQL ``EXISTS`` and ``NOT EXISTS`` constructs are typically used for
efficient filtering in large many-to-many relationships.
Let's consider a web dataset involving website session / usage data and
purchases:
.. ipython:: python
events = ibis.table([('session_id', 'int64'),
('user_id', 'int64'),
('event_type', 'int32'),
('ts', 'timestamp')], 'events')
purchases = ibis.table([('item_id', 'int64'),
('user_id', 'int64'),
('price', 'double'),
('ts', 'timestamp')], 'purchases')
Now, the key ``user_id`` appears with high frequency in both tables. But let's
say you want to limit your analysis of the ``events`` table to only sessions by
users who have made a purchase.
In SQL, you can do this using the somewhat esoteric ``EXISTS`` construct:
.. code-block:: sql
SELECT t0.*
FROM events t0
WHERE EXISTS (
SELECT 1
FROM purchases t1
WHERE t0.user_id = t1.user_id
)
To describe this operation in Ibis, you compare the ``user_id`` columns and use
the ``any`` reduction:
.. ipython:: python
cond = (events.user_id == purchases.user_id).any()
This can now be used to filter ``events``:
.. ipython:: python
expr = events[cond]
print(ibis.impala.compile(expr))
If you negate the condition, it will instead give you only event data from user
*that have not made a purchase*:
.. ipython:: python
expr = events[-cond]
print(ibis.impala.compile(expr))
Subqueries with ``IN`` / ``NOT IN``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Subquery filters with ``IN`` (and ``NOT IN``) are functionally similar to
``EXISTS`` subqueries. Let's look at some SQL:
.. code-block:: sql
SELECT *
FROM events
WHERE user_id IN (
SELECT user_id
FROM purchases
)
This is almost semantically the same as the ``EXISTS`` example. Indeed, you can
write with Ibis:
.. ipython:: python
cond = events.user_id.isin(purchases.user_id)
expr = events[cond]
print(ibis.impala.compile(expr))
Depending on the query engine, the query planner/optimizer will often rewrite
``IN`` or ``EXISTS`` subqueries into the same set of relational algebra
operations.
Comparison with scalar aggregates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sometime you want to compare a value with an unconditional aggregate value from
a different table. Take the SQL:
.. code-block:: sql
SELECT *
FROM table1
WHERE value1 > (
SELECT max(value2)
FROM table2
)
With Ibis, the code is simpler and more pandas-like:
.. ipython:: python
expr = t1[t1.value1 > t2.value2.max()]
print(ibis.impala.compile(expr))
Conditional aggregates
~~~~~~~~~~~~~~~~~~~~~~
Suppose you want to compare a value with the aggregate value for some common
group values between two tables. Here's some SQL:
.. code-block:: sql
SELECT *
FROM table1 t0
WHERE value1 > (
SELECT avg(value2)
FROM table2 t1
WHERE t0.key1 = t1.key3
)
This query computes the average for each distinct value of ``key3`` and uses
the corresponding average for the comparison, rather than the whole-table
average as above.
With Ibis, the code is similar, but you add the correlated filter to the
average statistic:
.. ipython:: python
stat = t2[t1.key1 == t2.key3].value2.mean()
expr = t1[t1.value1 > stat]
print(ibis.impala.compile(expr))
``DISTINCT`` expressions
------------------------
In SQL, the ``DISTINCT`` keyword is used in a couple of ways:
* Deduplicating identical rows in some ``SELECT`` statement
* Aggregating on the distinct values of some column expression
Ibis supports both use cases. So let's have a look. The first case is the
simplest: call ``distinct`` on a table expression. First, here's the SQL:
.. code-block:: sql
SELECT DISTINCT *
FROM table1
And the Ibis Python code:
.. ipython:: python
expr = t1.distinct()
print(ibis.impala.compile(expr))
For distinct aggregates, the most common case is ``COUNT(DISTINCT ...)``, which
computes the number of unique values in an expression. So if we're looking at
the ``events`` table, let's compute the number of distinct ``event_type``
values for each ``user_id``. First, the SQL:
.. code-block:: sql
SELECT user_id, COUNT(DISTINCT event_type) AS unique_events
FROM events
GROUP BY 1
In Ibis this is:
.. ipython:: python
metric = events.event_type.nunique()
expr = (events.group_by('user_id')
.aggregate(unique_events=metric))
print(ibis.impala.compile(expr))
Window functions
----------------
Window functions in SQL allow you to write expressions that involve
possibly-ordered groups of a table. Each window function involves one of the
following:
* An analytic function. Most aggregate functions are valid analytic functions,
and there are additional ones such as ``LEAD``, ``LAG``, ``NTILE``, and
others.
* A ``PARTITION BY`` clause. This may be omitted.
* An ``ORDER BY`` clause. This may be omitted for many functions.
* A window frame clause. The default is to use the entire partition.
So you may see SQL like:
.. code-block:: sql
AVG(value) OVER (PARTITION BY key1)
Or simply
.. code-block:: sql
AVG(value) OVER ()
Ibis will automatically write window clauses when you use aggregate functions
in a non-aggregate context. Suppose you wanted to subtract the mean of a column
from itself:
.. ipython:: python
expr = t.mutate(two_demean=t.two - t.two.mean())
print(ibis.impala.compile(expr))
If you use ``mutate`` in conjunction with ``group_by``, it will add a
``PARTITION BY`` to the ``OVER`` specification:
.. ipython:: python
expr = (t.group_by('one')
.mutate(two_demean=t.two - t.two.mean()))
print(ibis.impala.compile(expr))
For functions like ``LAG`` that require an ordering, we can add an ``order_by``
call:
.. ipython:: python
expr = (t.group_by('one')
.order_by(t.two)
.mutate(two_first_diff=t.two - t.two.lag()))
print(ibis.impala.compile(expr))
For more precision, you can create a ``Window`` object that also includes a
window frame clause:
.. ipython:: python
w = ibis.window(group_by='one', preceding=5, following=5)
expr = t.mutate(group_demeaned=t.two - t.two.mean().over(w))
print(ibis.impala.compile(expr))
Top-K operations
----------------
A common SQL idiom is the "top-K" or "top-N" operation: subsetting a dimension
by aggregate statistics:
.. code-block:: sql
SELECT key1, count(*) AS `count`
FROM table1
GROUP BY 1
ORDER BY `count` DESC
LIMIT 10
Ibis has a special analytic expression ``topk``:
.. ipython:: python
expr = t1.key1.topk(10)
This can be evaluated directly, yielding the above query:
.. ipython:: python
print(ibis.impala.compile(expr))
You can also use ``expr`` as a filter:
.. ipython:: python
expr2 = t1[expr]
print(ibis.impala.compile(expr2))
Date / time data
----------------
See :ref:`Timestamp methods ` for a table of available date/time
methods.
For example, we can do:
.. ipython:: python
expr = events.mutate(year=events.ts.year(),
month=events.ts.month())
print(ibis.impala.compile(expr))
Casting to date / time types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In many cases, you can convert string values to datetime / timestamp with
``strings.cast('timestamp')``, but you may have to do some more reconnaissance
into the data if this does not work.
Timedeltas
~~~~~~~~~~
Ibis has a set of *timedelta* object that allow you to do date/time
arithmetic. For example:
.. ipython:: python
expr = events[events.ts > (ibis.now() - ibis.interval(years=1))]
print(ibis.impala.compile(expr))
The implementation of each timedelta offset will depend on the query engine.
Buckets and histograms
----------------------
To appear.
Unions
------
SQL dialects often support two kinds of ``UNION`` operations:
* ``UNION``: the combination of *distinct* rows from each table.
* ``UNION ALL``: the combination of all rows from each table, whether or not
they are distinct.
The Ibis ``union`` function by distinct is a ``UNION ALL``, and you can set
``distinct=True`` to get the normal ``UNION`` behavior:
.. ipython:: python
expr1 = t1.limit(10)
expr2 = t1.limit(10, offset=10)
expr = expr1.union(expr2)
print(ibis.impala.compile(expr))
Esoterica
---------
This area will be the spillover for miscellaneous SQL concepts and how queries
featuring them can be ported to Ibis.
Common table expressions (CTEs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The simplest SQL CTE is a SQL statement that is used multiple times in a
``SELECT`` query, which can be "factored" out using the ``WITH`` keyword:
.. code-block:: sql
WITH t0 AS (
SELECT region, kind, sum(amount) AS total
FROM purchases
GROUP BY 1, 2
)
SELECT t0.region, t0.total - t1.total
FROM t0
INNER JOIN t0 t1
ON t0.region = t1.region
WHERE t0.kind = 'foo' AND t1.kind = 'bar'
Explicit CTEs are not necessary with Ibis. Let's look at an example involving
joining an aggregated table on itself after filtering:
.. ipython:: python
purchases = ibis.table([('region', 'string'),
('kind', 'string'),
('user', 'int64'),
('amount', 'double')], 'purchases')
metric = purchases.amount.sum().name('total')
agged = (purchases.group_by(['region', 'kind'])
.aggregate(metric))
left = agged[agged.kind == 'foo']
right = agged[agged.kind == 'bar']
result = (left.join(right, left.region == right.region)
[left.region,
(left.total - right.total).name('diff')])
Ibis automatically creates a CTE for ``agged``:
.. ipython:: python
print(ibis.impala.compile(result))