ClickHouse

Ibis supports reading and querying data using ClickHouse as a backend.

In this example we’ll demonstrate using Ibis to connect to a ClickHouse server, and to execute a few queries.

from ibis.interactive import *

Creating a Connection

First we need to connect Ibis to a running ClickHouse server.

In this example we’ll run queries against the publicly available ClickHouse playground server.

To run against your own ClickHouse server you’d only need to change the connection details.

con = ibis.connect("clickhouse://play:clickhouse@play.clickhouse.com:443?secure=True")

Listing available tables

The ClickHouse playground server has a number of interesting datasets available. To see them, we can examine the tables via the .tables attribute.

This shows a list of all tables available:

con.tables

Tables
------
- actors
- all_replicas_metric_log
- benchmark_results
- benchmark_runs
- cell_towers
- checks
- cisco_umbrella
- covid
- dish
- dns
- dns2
- food_facts
- github_events
- hackernews
- hackernews_changes_items
- hackernews_changes_profiles
- hackernews_history
- hackernews_top
- hits
- lineorder
- loc_stats
- menu
- menu_item
- menu_item_denorm
- menu_page
- minicrawl
- newswire
- ontime
- opensky
- pypi
- query_metrics_v2
- rdns
- recipes
- repos
- repos_raw
- run_attributes_v1
- search_clickhouse_stackoverflow
- search_stackoverflow
- stackoverflow
- stock
- tranco
- trips
- uk_price_paid
- uk_price_paid_updater
- wikistat
- workflow_jobs

Inspecting a Table

Lets take a look at the hackernews table. This table contains all posts and comments on Hacker News.

We can access the table by attribute as con.tables.hackernews.

t = con.tables.hackernews

We can then take a peak at the first few rows using the .head() method.

t.head()

┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ id      ┃ deleted ┃ type    ┃ by          ┃ time                ┃ text    ┃ dead   ┃ parent  ┃ poll    ┃ kids                 ┃ url                                                                           ┃ score  ┃ title                                                       ┃ parts           ┃ descendants ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !uint32 │ !uint8  │ !string │ !string     │ !timestamp(0)       │ !string │ !uint8 │ !uint32 │ !uint32 │ !array<!uint32>      │ !string                                                                       │ !int32 │ !string                                                     │ !array<!uint32> │ !int32      │
├─────────┼─────────┼─────────┼─────────────┼─────────────────────┼─────────┼────────┼─────────┼─────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼────────┼─────────────────────────────────────────────────────────────┼─────────────────┼─────────────┤
│       1 │       0 │ story   │ pg          │ 2006-10-09 18:21:51 │ ~       │      0 │       0 │       0 │ [15, 234509, ... +2] │ http://ycombinator.com                                                        │     57 │ Y Combinator                                                │ []              │          15 │
│       2 │       0 │ story   │ phyllis     │ 2006-10-09 18:30:28 │ ~       │      0 │       0 │       0 │ []                   │ http://www.paulgraham.com/mit.html                                            │     16 │ A Student's Guide to Startups                               │ []              │           0 │
│       3 │       0 │ story   │ phyllis     │ 2006-10-09 18:40:33 │ ~       │      0 │       0 │       0 │ [531602]             │ http://www.foundersatwork.com/stevewozniak.html                               │      7 │ Woz Interview: the early days of Apple                      │ []              │           0 │
│       4 │       0 │ story   │ onebeerdave │ 2006-10-09 18:47:42 │ ~       │      0 │       0 │       0 │ []                   │ http://avc.blogs.com/a_vc/2006/10/the_nyc_develop.html                        │      5 │ NYC Developer Dilemma                                       │ []              │           0 │
│       5 │       0 │ story   │ perler      │ 2006-10-09 18:51:04 │ ~       │      0 │       0 │       0 │ []                   │ http://www.techcrunch.com/2006/10/09/google-youtube-sign-more-separate-deals/ │      7 │ Google, YouTube acquisition announcement could come tonight │ []              │           0 │
└─────────┴─────────┴─────────┴─────────────┴─────────────────────┴─────────┴────────┴─────────┴─────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────┴─────────────────┴─────────────┘

Finding the highest scoring posts

Here we find the top 5 posts by score.

Posts have a title, so we:

filter out rows that lack a title
select only the columns we’re interested in
order them by score, descending
limit to the top 5 rows

top_posts_by_score = (
    t.filter(_.title != "")
    .select("title", "score")
    .order_by(ibis.desc("score"))
    .limit(5)
)

top_posts_by_score

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ title                       ┃ score  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ !string                     │ !int32 │
├─────────────────────────────┼────────┤
│ Stephen Hawking has died    │   6015 │
│ A Message to Our Customers  │   5771 │
│ Steve Jobs has passed away. │   4338 │
│ Bram Moolenaar has died     │   4310 │
│ Mechanical Watch            │   4298 │
└─────────────────────────────┴────────┘

Finding the most prolific commenters

Here we find the top 5 commenters by number of comments made.

To do this we:

filter out rows with no author
group_by author
count all the rows in each group
order_by the counts, descending
limit to the top 5 rows

top_commenters = (
    t.filter(_.by != "")
    .group_by("by")
    .agg(count=_.count())
    .order_by(ibis.desc("count"))
    .limit(5)
)

top_commenters

┏━━━━━━━━━━━━━━┳━━━━━━━┓
┃ by           ┃ count ┃
┡━━━━━━━━━━━━━━╇━━━━━━━┩
│ !string      │ int64 │
├──────────────┼───────┤
│ dang         │ 64937 │
│ tptacek      │ 61479 │
│ jacquesm     │ 56408 │
│ pjmlp        │ 54785 │
│ dragonwriter │ 51150 │
└──────────────┴───────┘

This query could also be expressed using the .topk method, which is a shorthand for the above:

# This is a shorthand for the above
top_commenters = t.filter(_.by != "").by.topk(5)

top_commenters

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ by           ┃ CountStar() ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !string      │ int64       │
├──────────────┼─────────────┤
│ dang         │       64937 │
│ tptacek      │       61479 │
│ jacquesm     │       56408 │
│ pjmlp        │       54785 │
│ dragonwriter │       51150 │
└──────────────┴─────────────┘

Finding top commenters by score

Here we find the top 5 commenters with the highest cumulative scores.

top_commenters_by_score = (
    t.filter(_.by != "")
    .group_by("by")
    .agg(total_score=_.score.sum())
    .order_by(ibis.desc("total_score"))
    .limit(5)
)

top_commenters_by_score

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ by           ┃ total_score ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !string      │ int64       │
├──────────────┼─────────────┤
│ ingve        │      405248 │
│ tosh         │      301694 │
│ pseudolus    │      295067 │
│ Tomte        │      276776 │
│ todsacerdoti │      269524 │
└──────────────┴─────────────┘

Using the .topk method, this can be written in a single line:

top_commenters_by_score = t.by.topk(5, by=_.score.sum().name("total_score"))

top_commenters_by_score

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ by           ┃ total_score ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !string      │ int64       │
├──────────────┼─────────────┤
│ ingve        │      405248 │
│ tosh         │      301694 │
│ pseudolus    │      295067 │
│ Tomte        │      276776 │
│ todsacerdoti │      269524 │
└──────────────┴─────────────┘

Next Steps

There are lots of other interesting queries one might ask of this dataset.

A few examples:

What posts had the most comments?
How do post scores fluctuate over time?
What day of the week has the highest average post score? What day has the lowest?

To learn more about how to use Ibis with Clickhouse, see the documentation.