{
“cells”: [
{

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“# Advanced Topics: Additional Filteringn”, “n”, “The filtering examples we’ve shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin and notin. n”, “n”, “Ibis supports a number of richer analytical filters that can involve one or more of:n”, “n”, “- Aggregates computed from the same or other tablesn”, “- Conditional aggregates (in SQL-speak these are similar to "correlated subqueries")n”, “- "Existence" set filters (equivalent to the SQL EXISTS and NOT EXISTS keywords)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Setup”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“import ibisn”, “import osn”, “hdfs_port = os.environ.get(‘IBIS_WEBHDFS_PORT’, 50070)n”, “hdfs = ibis.impala.hdfs_connect(host=’impala’, port=hdfs_port)n”, “con = ibis.impala.connect(host=’impala’, database=’ibis_testing’,n”, ” hdfs_client=hdfs)n”, “ibis.options.interactive = True”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Using scalar aggregates in filters”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“table = con.table(‘functional_alltypes’)n”, “table.limit(5)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column for example:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“table.double_col.mean()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access Impala multiple times:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“cond = table.bigint_col > table.double_col.mean()n”, “expr = table[cond & table.bool_col].limit(5)n”, “expr”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## Conditional aggregatesn”, “n”, “n”, “Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true. Using the TPC-H datasets, suppose that we want to filter customers based on the following criteria: Orders such that their amount exceeds the average amount for their sales region over the whole dataset. This can be computed any numbers of ways (such as joining auxiliary tables and filtering post-join)n”, “n”, “Again, from prior examples, here are the joined up tables with all the customer data:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“region = con.table(‘tpch_region’)n”, “nation = con.table(‘tpch_nation’)n”, “customer = con.table(‘tpch_customer’)n”, “orders = con.table(‘tpch_orders’)n”, “n”, “fields_of_interest = [customer,n”, ” region.r_name.name(‘region’), n”, ” orders.o_totalprice,n”, ” orders.o_orderdate.cast(‘timestamp’).name(‘odate’)]n”, “n”, “tpch = (region.join(nation, region.r_regionkey == nation.n_regionkey)n”, ” .join(customer, customer.c_nationkey == nation.n_nationkey)n”, ” .join(orders, orders.o_custkey == customer.c_custkey)n”, ” [fields_of_interest])n”, “n”, “tpch.limit(5)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“In this particular case, filtering based on the conditional average o_totalprice by region requires creating a table view (similar to the self-join examples from earlier) that can be treated as a distinct table entity in the expression. This would not be required if we were computing a conditional statistic from some other table. So this is a little more complicated than some other cases would be:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“t2 = tpch.view()n”, “conditional_avg = t2[(t2.region == tpch.region)].o_totalprice.mean()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Once you’ve done this, you can use the conditional average in a filter expression”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {

“scrolled”: true

}, “outputs”: [], “source”: [

“amount_filter = tpch.o_totalprice > conditional_avgn”, “tpch[amount_filter].limit(10)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“By looking at the table sizes before and after applying the filter you can see the relative size of the subset taken. “

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“tpch.count()”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“tpch[amount_filter].count()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Or even group by year and compare before and after:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“tpch.schema()”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“year = tpch.odate.year().name(‘year’)n”, “n”, “pre_sizes = tpch.group_by(year).size()n”, “post_sizes = tpch[amount_filter].group_by(year).size().view()n”, “n”, “percent = ((post_sizes[‘count’] / pre_sizes[‘count’].cast(‘double’))n”, ” .name(‘fraction’))n”, “n”, “expr = (pre_sizes.join(post_sizes, pre_sizes.year == post_sizes.year)n”, ” [pre_sizes.year, n”, ” pre_sizes[‘count’].name(‘pre_count’),n”, ” post_sizes[‘count’].name(‘post_count’),n”, ” percent])n”, “expr”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“## "Existence" filtersn”, “n”, “n”, “Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it’s nice to be able to express it with a single concise statement and let Ibis compute it optimally.n”, “n”, “Here’s some examples:n”, “n”, “- Filter down to customers having at least one open ordern”, “- Find customers having no open orders with 1-URGENT statusn”, “- Find stores (in the stores table) having the same name as a vendor (in the vendors table).n”, “n”, “We’ll go ahead and solve the first couple of these problems using the TPC-H tables to illustrate the API:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“customer = con.table(‘tpch_customer’)n”, “orders = con.table(‘tpch_orders’)”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“orders.limit(5)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“We introduce the any reduction:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“has_open_orders = ((orders.o_orderstatus == ‘O’) & n”, ” (customer.c_custkey == orders.o_custkey)).any()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“This is now a valid filter:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“customer[has_open_orders].limit(10)”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“For the second example, in which we want to find customers not having any open urgent orders, we write down the condition that they _do_ have some first:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“has_open_urgent_orders = ((orders.o_orderstatus == ‘O’) & n”, ” (orders.o_orderpriority == ‘1-URGENT’) & n”, ” (customer.c_custkey == orders.o_custkey)).any()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“Now, we can negate this condition and use it as a filter:”

]

}, {

“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”: [], “source”: [

“customer[-has_open_urgent_orders].count()”

]

}, {

“cell_type”: “markdown”, “metadata”: {}, “source”: [

“In this case, it is true that customer.c_custkey has no duplicate values, but that need not be the case. There could be multiple copies of any given value in either table column being compared, and the behavior will be the same (existence or non-existence is verified).”

]

}

], “metadata”: {

“kernelspec”: {

“display_name”: “Python 3”, “language”: “python”, “name”: “python3”

}, “language_info”: {

“codemirror_mode”: {

“name”: “ipython”, “version”: 3

}, “file_extension”: “.py”, “mimetype”: “text/x-python”, “name”: “python”, “nbconvert_exporter”: “python”, “pygments_lexer”: “ipython3”, “version”: “3.6.3”

}

}, “nbformat”: 4, “nbformat_minor”: 1

}