More Value Expressions

Setup

[1]:
import os
import ibis

ibis.options.interactive = True

connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))

Type casting

The Ibis type system is pretty basic and will get better (and more documented over time). It maps directly onto the current Impala type system

  • int8

  • int16

  • int32

  • int64

  • boolean

  • float

  • double

  • string

  • timestamp

  • decimal($precision, $scale)

These type names can be used to cast from one type to another

[2]:
countries = connection.table('countries')
countries
connection.table('gdp')
[2]:
     country_code  year         value
0             ABW  1986  4.054634e+08
1             ABW  1987  4.876025e+08
2             ABW  1988  5.964236e+08
3             ABW  1989  6.953044e+08
4             ABW  1990  7.648871e+08
...           ...   ...           ...
9995          SVK  2002  3.513034e+10
9996          SVK  2003  4.681659e+10
9997          SVK  2004  5.733202e+10
9998          SVK  2005  6.278531e+10
9999          SVK  2006  7.070810e+10

[10000 rows x 3 columns]
[3]:
countries = connection.table('countries')
countries.population.cast('float').sum()
[3]:
6878963738.0
[4]:
countries.area_km2.cast('int32').sum()
[4]:
150012536

Case / if-then-else expressions

We support a number of variants of the SQL-equivalent CASE expression, and will add more API functions over time to meet different use cases and enhance the expressiveness of any branching-based value logic.

[5]:
expr = (countries.continent
        .case()
        .when('AF', 'Africa')
        .when('AN', 'Antarctica')
        .when('AS', 'Asia')
        .when('EU', 'Europe')
        .when('NA', 'North America')
        .when('OC', 'Oceania')
        .when('SA', 'South America')
        .else_(countries.continent)
        .end()
        .name('continent_name'))

expr.value_counts()
[5]:
  continent_name  count
0         Africa     58
1     Antarctica      5
2           Asia     51
3         Europe     54
4  North America     42
5        Oceania     28
6  South America     14

If the else_ default condition is not provided, any values not matching one of the conditions will be NULL.

[6]:
expr = (countries.continent
        .case()
        .when('AF', 'Africa')
        .when('AS', 'Asia')
        .when('EU', 'Europe')
        .when('NA', 'North America')
        .when('OC', 'Oceania')
        .when('SA', 'South America')
        .end()
        .name('continent_name_with_nulls'))

expr.value_counts()
[6]:
  continent_name_with_nulls  count
0                      None      5
1                    Africa     58
2                      Asia     51
3                    Europe     54
4             North America     42
5                   Oceania     28
6             South America     14

To test for an arbitrary series of boolean conditions, use the case API method and pass any boolean expressions potentially involving columns of the table:

[7]:
expr = (ibis.case()
        .when(countries.population > 25_000_000, 'big')
        .when(countries.population < 5_000_000, 'small')
        .else_('medium')
        .end()
        .name('size'))

countries['name', 'population', expr].limit(10)
[7]:
                   name  population    size
0               Andorra       84000   small
1  United Arab Emirates     4975593   small
2           Afghanistan    29121286     big
3   Antigua and Barbuda       86754   small
4              Anguilla       13254   small
5               Albania     2986952   small
6               Armenia     2968000   small
7  Netherlands Antilles      300000   small
8                Angola    13068161  medium
9            Antarctica           0   small

Simple ternary-cases (like the Python X if COND else Y) can be written using the ifelse function:

[8]:
expr = ((countries.continent == 'AS')
        .ifelse('Asia', 'Not Asia')
        .name('is_asia'))

countries['name', 'continent', expr].limit(10)
[8]:
                   name continent   is_asia
0               Andorra        EU  Not Asia
1  United Arab Emirates        AS      Asia
2           Afghanistan        AS      Asia
3   Antigua and Barbuda        NA  Not Asia
4              Anguilla        NA  Not Asia
5               Albania        EU  Not Asia
6               Armenia        AS      Asia
7  Netherlands Antilles        NA  Not Asia
8                Angola        AF  Not Asia
9            Antarctica        AN  Not Asia

Set membership

The isin and notin functions are like their pandas counterparts. These can take:

  • A list of value expressions, either literal values or other column expressions

  • An array/column expression of some kind

[9]:
is_america = countries.continent.isin(['NA', 'SA'])
countries[is_america].continent.value_counts()
[9]:
  continent  count
0        NA     42
1        SA     14

You can also check for membership in an array. Here is an example of filtering based on the top 3 (ignoring ties) most frequently-occurring values in the string_col column of alltypes:

[10]:
top_continents = countries.continent.value_counts().limit(3).continent
top_continents_filter = countries.continent.isin(top_continents)
expr = countries[top_continents_filter]

expr.count()
[10]:
114

This is a common enough operation that we provide a special analytical filter function topk:

[11]:
countries.continent.topk(3)
[11]:
  continent  count
0        AF     58
1        EU     54
2        AS     51

Cool, huh? More on topk later.

Null-ness

Like their pandas equivalents, the isnull and notnull functions return True values if the values are null, or non-null, respectively. For example:

[12]:
expr = (countries.continent
        .case()
        .when('AF', 'Africa')
        .when('EU', 'Europe')
        .when('AS', 'Asia')
        .end()
        .name('top_continent_name'))

expr.isnull().value_counts()
[12]:
   unnamed  count
0    False    163
1     True     89

Functions like isnull can be combined with case expressions or functions like ifelse to replace null values with some other value. ifelse here will use the first value supplied for any True value and the second value for any False value. Either value can be a scalar or array.

[13]:
expr2 = expr.isnull().ifelse('Other continent', expr).name('continent')
expr2.value_counts()
[13]:
         continent  count
0           Africa     58
1             Asia     51
2           Europe     54
3  Other continent     89

Distinct-based operations

Ibis supports using distinct to remove duplicate rows or values on tables or arrays. For example:

[14]:
countries['continent'].distinct()
[14]:
0    EU
1    AS
2    NA
3    AF
4    AN
5    SA
6    OC
Name: continent, dtype: object
[15]:
countries.continent.distinct()
[15]:
0    EU
1    AS
2    NA
3    AF
4    AN
5    SA
6    OC
Name: continent, dtype: object

This can be combined with count to form a reduction metric:

[16]:
metric = (countries.continent
          .distinct().count()
          .name('num_continents'))
metric
[16]:
7

This is common enough to have a shortcut nunique:

[17]:
countries.continent.nunique()
[17]:
7

String operations

What’s supported is pretty basic right now. We intend to support the full gamut of regular expression munging with a nice API, though in some cases some work will be required on SQLite’s backend to support everything.

[18]:
countries[['name']].limit(5)
[18]:
                   name
0               Andorra
1  United Arab Emirates
2           Afghanistan
3   Antigua and Barbuda
4              Anguilla

At the moment, basic substring operations (substr, with conveniences left and right) and Python-like APIs such as lower and upper (for case normalization) are supported. So you could count first letter occurrences in a string column like so:

[19]:
expr = countries.name.lower().left(1).name('first_letter')
expr.value_counts().sort_by(('count', False)).limit(10)
[19]:
  first_letter  count
0            s     34
1            b     22
2            m     22
3            c     21
4            g     16
5            n     16
6            a     15
7            t     14
8            p     12
9            i     10

For fuzzy and regex filtering/searching, you can use one of the following

  • like, works as the SQL LIKE keyword

  • rlike, like re.search or SQL RLIKE

  • contains, like x in str_value in Python

[20]:
countries[countries.name.like('%GE%')].name
[20]:
0                                   Argentina
1                                     Germany
2                                     Algeria
3                                     Georgia
4    South Georgia and South Sandwich Islands
5                                       Niger
6                                     Nigeria
Name: name, dtype: object
[21]:
countries[countries.name.lower().rlike('.*ge.*')].name
[21]:
0                                   Argentina
1                                     Germany
2                                     Algeria
3                                     Georgia
4    South Georgia and South Sandwich Islands
5                                       Niger
6                                     Nigeria
Name: name, dtype: object
[22]:
countries[countries.name.lower().contains('ge')].name
[22]:
0                                   Argentina
1                                     Germany
2                                     Algeria
3                                     Georgia
4    South Georgia and South Sandwich Islands
5                                       Niger
6                                     Nigeria
Name: name, dtype: object

Timestamp operations

Date and time functionality is relatively limited at present compared with pandas, but we’ll get there. The main things we have right now are

  • Field access (year, month, day, …)

  • Timedeltas

  • Comparisons with fixed timestamps

[23]:
independence = connection.table('independence')

independence[independence.independence_date, independence.independence_date.month().name('month')].limit(10)
[23]:
  independence_date  month
0        1919-08-19      8
1        1912-11-28     11
2        1962-07-05      7
3        1975-11-11     11
4        1981-11-01     11
5        1816-07-09      7
6        1918-05-28      5
7        1991-09-21      9
8        1955-10-26     10
9        1918-05-28      5

Somewhat more comprehensively

[24]:
def get_field(f):
    return getattr(independence.independence_date, f)().name(f)

fields = ['year', 'month', 'day']  # datetime fields can also use: 'hour', 'minute', 'second', 'millisecond'
projection = [independence.independence_date] + [get_field(x) for x in fields]
independence[projection].limit(10)
[24]:
  independence_date  year  month  day
0        1919-08-19  1919      8   19
1        1912-11-28  1912     11   28
2        1962-07-05  1962      7    5
3        1975-11-11  1975     11   11
4        1981-11-01  1981     11    1
5        1816-07-09  1816      7    9
6        1918-05-28  1918      5   28
7        1991-09-21  1991      9   21
8        1955-10-26  1955     10   26
9        1918-05-28  1918      5   28

For timestamp arithmetic and comparisons, check out functions in the top level ibis namespace. This include things like day and second, but also the ibis.timestamp function:

[25]:
independence[independence.independence_date.min(),
             independence.independence_date.max(),
             independence.count().name('nrows')].distinct()
[25]:
          min        max  nrows
0  1291-08-01 2011-07-09    186
[26]:
independence[independence.independence_date > '2000-01-01'].count()
[26]:
3

Some backends support adding offsets, for example independence.independence_date + ibis.interval(days=1) or ibis.now() - independence.independence_date.