Advanced Topics: Additional Filtering

The filtering examples we’ve shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin and notin.

Ibis supports a number of richer analytical filters that can involve one or more of:

  • Aggregates computed from the same or other tables

  • Conditional aggregates (in SQL-speak these are similar to “correlated subqueries”)

  • “Existence” set filters (equivalent to the SQL EXISTS and NOT EXISTS keywords)

Setup

[1]:
import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='impala', port=hdfs_port)
con = ibis.impala.connect(host='impala', database='ibis_testing',
                          hdfs_client=hdfs)
ibis.options.interactive = True

Using scalar aggregates in filters

[2]:
table = con.table('functional_alltypes')
table.limit(5)
[2]:
     id  bool_col  tinyint_col  smallint_col  int_col  bigint_col  float_col  \
0  5770      True            0             0        0           0        0.0
1  5771     False            1             1        1          10        1.1
2  5772      True            2             2        2          20        2.2
3  5773     False            3             3        3          30        3.3
4  5774      True            4             4        4          40        4.4

   double_col date_string_col string_col           timestamp_col  year  month
0         0.0        08/01/10          0 2010-08-01 00:00:00.000  2010      8
1        10.1        08/01/10          1 2010-08-01 00:01:00.000  2010      8
2        20.2        08/01/10          2 2010-08-01 00:02:00.100  2010      8
3        30.3        08/01/10          3 2010-08-01 00:03:00.300  2010      8
4        40.4        08/01/10          4 2010-08-01 00:04:00.600  2010      8

We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column for example:

[3]:
table.double_col.mean()
[3]:
45.45000000000001

You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access Impala multiple times:

[4]:
cond = table.bigint_col > table.double_col.mean()
expr = table[cond & table.bool_col].limit(5)
expr
[4]:
     id  bool_col  tinyint_col  smallint_col  int_col  bigint_col  float_col  \
0  5776      True            6             6        6          60        6.6
1  5778      True            8             8        8          80        8.8
2  5786      True            6             6        6          60        6.6
3  5788      True            8             8        8          80        8.8
4  5796      True            6             6        6          60        6.6

   double_col date_string_col string_col           timestamp_col  year  month
0        60.6        08/01/10          6 2010-08-01 00:06:00.150  2010      8
1        80.8        08/01/10          8 2010-08-01 00:08:00.280  2010      8
2        60.6        08/02/10          6 2010-08-02 00:16:00.600  2010      8
3        80.8        08/02/10          8 2010-08-02 00:18:00.730  2010      8
4        60.6        08/03/10          6 2010-08-03 00:26:01.500  2010      8

Conditional aggregates

Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true. Using the TPC-H datasets, suppose that we want to filter customers based on the following criteria: Orders such that their amount exceeds the average amount for their sales region over the whole dataset. This can be computed any numbers of ways (such as joining auxiliary tables and filtering post-join)

Again, from prior examples, here are the joined up tables with all the customer data:

[5]:
region = con.table('tpch_region')
nation = con.table('tpch_nation')
customer = con.table('tpch_customer')
orders = con.table('tpch_orders')

fields_of_interest = [customer,
                      region.r_name.name('region'),
                      orders.o_totalprice,
                      orders.o_orderdate.cast('timestamp').name('odate')]

tpch = (region.join(nation, region.r_regionkey == nation.n_regionkey)
        .join(customer, customer.c_nationkey == nation.n_nationkey)
        .join(orders, orders.o_custkey == customer.c_custkey)
        [fields_of_interest])

tpch.limit(5)
[5]:
   c_custkey              c_name                                 c_address  \
0      36901  Customer#000036901                      TBb1yDZcf 8Zepk7apFJ
1      78002  Customer#000078002                                v7Jkg5XIqM
2     123314  Customer#000123314      nKPmaZi,OKhObOYSL3wc egXR4Vt99CXRclF
3     136777  Customer#000136777  Qy6YjXMy1jjCBkVDvnDThNMMLQG49wXEgIJ6DPLK
4      44485  Customer#000044485                           hlyA8oJVqUQfBOb

   c_nationkey          c_phone c_acctbal c_mktsegment  \
0           13  23-644-998-4944   4809.84   AUTOMOBILE
1           10  20-715-308-7926   4128.41   AUTOMOBILE
2           15  25-884-345-1592   -686.40    MACHINERY
3           10  20-500-807-1549    -65.46    HOUSEHOLD
4           20  30-452-969-2072   4543.02    FURNITURE

                                           c_comment       region  \
0  nstructions sleep final, regular deposits. qui...  MIDDLE EAST
1    ly after the special deposits. careful packages  MIDDLE EAST
2             ounts serve furiously. carefully expre       AFRICA
3  y regular foxes nag blithely among the careful...  MIDDLE EAST
4                ng excuses cajole along the silent,  MIDDLE EAST

  o_totalprice      odate
0    173665.47 1996-01-02
1     46929.18 1996-12-01
2    193846.25 1993-10-14
3     32151.78 1995-10-11
4    144659.20 1994-07-30

In this particular case, filtering based on the conditional average o_totalprice by region requires creating a table view (similar to the self-join examples from earlier) that can be treated as a distinct table entity in the expression. This would not be required if we were computing a conditional statistic from some other table. So this is a little more complicated than some other cases would be:

[6]:
t2 = tpch.view()
conditional_avg = t2[(t2.region == tpch.region)].o_totalprice.mean()

Once you’ve done this, you can use the conditional average in a filter expression

[7]:
amount_filter = tpch.o_totalprice > conditional_avg
tpch[amount_filter].limit(10)
[7]:
   c_custkey              c_name                              c_address  \
0      36901  Customer#000036901                   TBb1yDZcf 8Zepk7apFJ
1     123314  Customer#000123314   nKPmaZi,OKhObOYSL3wc egXR4Vt99CXRclF
2      39136  Customer#000039136        afZJC1mWpwvsfKT0211ZD6NQXVGETfl
3     130057  Customer#000130057            jQDBlCU2IlHmzkDfcqgIHg2eLsN
4      66958  Customer#000066958            h5jsmOq8nxf8Pz1Knqe GZdK4lh
5     127588  Customer#000127588      DbnvQxsG0,Nobhbj6n5cMUNPjfouzdFzH
6      86116  Customer#000086116  63BSp8bODm1dImPJEPTRmsSa4GqNA1SeRqFgx
7      81763  Customer#000081763     mZtn4M5r0KIw4aooP BXF3ReR RUlPJcAb
8      56614  Customer#000056614                    sRgxfCf6gDLmFnrO8UO
9      28547  Customer#000028547            AeWmD3BLrsSkmRY7O,wbB75i6Ll

   c_nationkey          c_phone c_acctbal c_mktsegment  \
0           13  23-644-998-4944   4809.84   AUTOMOBILE
1           15  25-884-345-1592   -686.40    MACHINERY
2            5  15-400-347-1643   5555.41    FURNITURE
3            9  19-938-862-4157   5009.55    FURNITURE
4           18  28-393-112-1873   9160.79    MACHINERY
5           14  24-409-883-5840    358.38     BUILDING
6            0  10-356-493-3518   3205.60   AUTOMOBILE
7            8  18-425-613-5972   8368.23    MACHINERY
8           11  21-657-845-6087   7623.48     BUILDING
9            1  11-711-951-5798   2095.42    MACHINERY

                                           c_comment       region  \
0  nstructions sleep final, regular deposits. qui...  MIDDLE EAST
1             ounts serve furiously. carefully expre       AFRICA
2  y? express theodolites haggle against the bold...       AFRICA
3   blithely regular packages. carefully bold acc...         ASIA
4  ggle quickly after the carefully stealthy depo...         ASIA
5                  accounts wake slyly along the bli       AFRICA
6    ironic ideas. quickly pending ideas sleep blith       AFRICA
7  ronic frays. slyly pending pinto beans are fur...         ASIA
8  sts. slyly ironic sheaves cajole dogged packag...  MIDDLE EAST
9  y regular foxes nag quickly after the express,...      AMERICA

  o_totalprice      odate
0    173665.47 1996-01-02
1    193846.25 1993-10-14
2    252004.18 1996-01-10
3    208660.75 1995-07-16
4    163243.98 1993-10-27
5    253724.56 1995-10-23
6    206680.66 1992-06-03
7    341734.47 1996-09-20
8    169405.01 1996-12-19
9    330793.52 1998-04-18

By looking at the table sizes before and after applying the filter you can see the relative size of the subset taken.

[8]:
tpch.count()
[8]:
1500000
[9]:
tpch[amount_filter].count()
[9]:
711969

Or even group by year and compare before and after:

[10]:
tpch.schema()
[10]:
ibis.Schema {
  c_custkey     int64
  c_name        string
  c_address     string
  c_nationkey   int32
  c_phone       string
  c_acctbal     decimal(12, 2)
  c_mktsegment  string
  c_comment     string
  region        string
  o_totalprice  decimal(12, 2)
  odate         timestamp
}
[11]:
year = tpch.odate.year().name('year')

pre_sizes = tpch.group_by(year).size()
post_sizes = tpch[amount_filter].group_by(year).size().view()

percent = ((post_sizes['count'] / pre_sizes['count'].cast('double'))
           .name('fraction'))

expr = (pre_sizes.join(post_sizes, pre_sizes.year == post_sizes.year)
        [pre_sizes.year,
         pre_sizes['count'].name('pre_count'),
         post_sizes['count'].name('post_count'),
         percent])
expr
[11]:
   year  pre_count  post_count  fraction
0  1994     227597      108087  0.474905
1  1996     228626      108757  0.475698
2  1992     227089      107815  0.474770
3  1998     133623       63551  0.475599
4  1993     226645      107703  0.475206
5  1995     228637      108315  0.473742
6  1997     227783      107741  0.472998

“Existence” filters

Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it’s nice to be able to express it with a single concise statement and let Ibis compute it optimally.

Here’s some examples:

  • Filter down to customers having at least one open order

  • Find customers having no open orders with 1-URGENT status

  • Find stores (in the stores table) having the same name as a vendor (in the vendors table).

We’ll go ahead and solve the first couple of these problems using the TPC-H tables to illustrate the API:

[12]:
customer = con.table('tpch_customer')
orders = con.table('tpch_orders')
[13]:
orders.limit(5)
[13]:
   o_orderkey  o_custkey o_orderstatus o_totalprice o_orderdate  \
0           1      36901             O    173665.47  1996-01-02
1           2      78002             O     46929.18  1996-12-01
2           3     123314             F    193846.25  1993-10-14
3           4     136777             O     32151.78  1995-10-11
4           5      44485             F    144659.20  1994-07-30

  o_orderpriority          o_clerk  o_shippriority  \
0           5-LOW  Clerk#000000951               0
1        1-URGENT  Clerk#000000880               0
2           5-LOW  Clerk#000000955               0
3           5-LOW  Clerk#000000124               0
4           5-LOW  Clerk#000000925               0

                                           o_comment
0                 nstructions sleep furiously among
1   foxes. pending accounts at the pending, silen...
2  sly final accounts boost. carefully regular id...
3  sits. slyly regular warthogs cajole. regular, ...
4  quickly. bold deposits sleep slyly. packages u...

We introduce the any reduction:

[14]:
has_open_orders = ((orders.o_orderstatus == 'O') &
                   (customer.c_custkey == orders.o_custkey)).any()

This is now a valid filter:

[15]:
customer[has_open_orders].limit(10)
[15]:
   c_custkey              c_name                              c_address  \
0          1  Customer#000000001                      IVhzIApeRb ot,c,E
1          2  Customer#000000002         XSTf4,NCwDVaWNe6tEgvwfmRchLXak
2          4  Customer#000000004                            XxVSJsLAGtn
3          5  Customer#000000005           KvpyuHCplrB84WgAiGV6sYpZq7Tj
4          7  Customer#000000007         TcGe5gaZNgVePxU5kRrvXBfkasDTea
5          8  Customer#000000008  I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5
6         10  Customer#000000010     6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2
7         11  Customer#000000011                PkWS 3HlXqwTuzrKg633BEi
8         13  Customer#000000013                nsXQu0oVjD7PM659uC3SRSp
9         14  Customer#000000014                        KXkletMlL2JQEA

   c_nationkey          c_phone c_acctbal c_mktsegment  \
0           15  25-989-741-2988    711.56     BUILDING
1           13  23-768-687-3665    121.65   AUTOMOBILE
2            4  14-128-190-5944   2866.83    MACHINERY
3            3  13-750-942-6364    794.47    HOUSEHOLD
4           18  28-190-982-9759   9561.95   AUTOMOBILE
5           17  27-147-574-9335   6819.74     BUILDING
6            5  15-741-346-9870   2753.54    HOUSEHOLD
7           23  33-464-151-3439   -272.60     BUILDING
8            3  13-761-547-5974   3857.34     BUILDING
9            1  11-845-129-3851   5266.30    FURNITURE

                                           c_comment
0  to the even, regular platelets. regular, ironi...
1  l accounts. blithely ironic theodolites integr...
2   requests. final, regular ideas sleep final accou
3  n accounts will have to unwind. foxes cajole a...
4  ainst the ironic, express theodolites. express...
5  among the slyly regular theodolites kindle bli...
6                    es regular deposits haggle. fur
7  ckages. requests sleep slyly. quickly even pin...
8  ounts sleep carefully after the close frays. c...
9                  , ironic packages across the unus

For the second example, in which we want to find customers not having any open urgent orders, we write down the condition that they do have some first:

[16]:
has_open_urgent_orders = ((orders.o_orderstatus == 'O') &
                          (orders.o_orderpriority == '1-URGENT') &
                          (customer.c_custkey == orders.o_custkey)).any()

Now, we can negate this condition and use it as a filter:

[17]:
customer[-has_open_urgent_orders].count()
[17]:
75801

In this case, it is true that customer.c_custkey has no duplicate values, but that need not be the case. There could be multiple copies of any given value in either table column being compared, and the behavior will be the same (existence or non-existence is verified).