ClickHouse

Ibis supports reading and querying data using ClickHouse as a backend.

In this example we’ll demonstrate using Ibis to connect to a ClickHouse server, and to execute a few queries.

from ibis.interactive import *

Creating a Connection

First we need to connect Ibis to a running ClickHouse server.

In this example we’ll run queries against the publicly available ClickHouse playground server.

To run against your own ClickHouse server you’d only need to change the connection details.

con = ibis.connect("clickhouse://play@play.clickhouse.com:443")

Listing available tables

The ClickHouse playground server has a number of interesting datasets available. To see them, we can examine the tables via the .tables attribute.

This shows a list of all tables available:

con.tables
Tables
------
- actors
- benchmark_results
- benchmark_runs
- cell_towers
- checks
- cisco_umbrella
- covid
- dish
- dns
- dns2
- food_facts
- github_events
- hackernews
- hits
- lineorder
- loc_stats
- menu
- menu_item
- menu_item_denorm
- menu_page
- minicrawl
- ontime
- opensky
- pypi
- query_metrics_v2
- rdns
- recipes
- repos
- repos_raw
- run_attributes_v1
- search_clickhouse_stackoverflow
- search_stackoverflow
- stackoverflow
- stock
- tranco
- trips
- uk_price_paid
- uk_price_paid_updater
- wikistat
- workflow_jobs

Inspecting a Table

Lets take a look at the hackernews table. This table contains all posts and comments on Hacker News.

We can access the table by attribute as con.tables.hackernews.

t = con.tables.hackernews

We can then take a peak at the first few rows using the .head() method.

t.head()
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ id       deleted  type     by           time                 text     dead    parent   poll     kids                  url                                                                            score   title                                                        parts            descendants ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !uint32!uint8!string!string!timestamp(0)!string!uint8!uint32!uint32!array<!uint32>!string!int32!string!array<!uint32>!int32      │
├─────────┼─────────┼─────────┼─────────────┼─────────────────────┼─────────┼────────┼─────────┼─────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼────────┼─────────────────────────────────────────────────────────────┼─────────────────┼─────────────┤
│       10story  pg         2006-10-09 18:21:51~000[15, 234509, ... +2]http://ycombinator.com                                                       57Y Combinator                                               []15 │
│       20story  phyllis    2006-10-09 18:30:28~000[]http://www.paulgraham.com/mit.html                                           16A Student's Guide to Startups                              []0 │
│       30story  phyllis    2006-10-09 18:40:33~000[531602]http://www.foundersatwork.com/stevewozniak.html                              7Woz Interview: the early days of Apple                     []0 │
│       40story  onebeerdave2006-10-09 18:47:42~000[]http://avc.blogs.com/a_vc/2006/10/the_nyc_develop.html                       5NYC Developer Dilemma                                      []0 │
│       50story  perler     2006-10-09 18:51:04~000[]http://www.techcrunch.com/2006/10/09/google-youtube-sign-more-separate-deals/7Google, YouTube acquisition announcement could come tonight[]0 │
└─────────┴─────────┴─────────┴─────────────┴─────────────────────┴─────────┴────────┴─────────┴─────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────┴─────────────────┴─────────────┘

Finding the highest scoring posts

Here we find the top 5 posts by score.

Posts have a title, so we:

  • filter out rows that lack a title
  • select only the columns we’re interested in
  • order them by score, descending
  • limit to the top 5 rows
top_posts_by_score = (
    t.filter(_.title != "")
    .select("title", "score")
    .order_by(ibis.desc("score"))
    .limit(5)
)

top_posts_by_score
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ title                        score  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ !string!int32 │
├─────────────────────────────┼────────┤
│ Stephen Hawking has died   6015 │
│ A Message to Our Customers 5771 │
│ Steve Jobs has passed away.4338 │
│ Bram Moolenaar has died    4310 │
│ Mechanical Watch           4298 │
└─────────────────────────────┴────────┘

Finding the most prolific commenters

Here we find the top 5 commenters by number of comments made.

To do this we:

  • filter out rows with no author
  • group_by author
  • count all the rows in each group
  • order_by the counts, descending
  • limit to the top 5 rows
top_commenters = (
    t.filter(_.by != "")
    .group_by("by")
    .agg(count=_.count())
    .order_by(ibis.desc("count"))
    .limit(5)
)

top_commenters
┏━━━━━━━━━━━━━━┳━━━━━━━┓
┃ by            count ┃
┡━━━━━━━━━━━━━━╇━━━━━━━┩
│ !stringint64 │
├──────────────┼───────┤
│ dang        64937 │
│ tptacek     61479 │
│ jacquesm    56408 │
│ pjmlp       54785 │
│ dragonwriter51150 │
└──────────────┴───────┘

This query could also be expressed using the .topk method, which is a shorthand for the above:

# This is a shorthand for the above
top_commenters = t.filter(_.by != "").by.topk(5)

top_commenters
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ by            Count(by) ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ !stringint64     │
├──────────────┼───────────┤
│ dang        64937 │
│ tptacek     61479 │
│ jacquesm    56408 │
│ pjmlp       54785 │
│ dragonwriter51150 │
└──────────────┴───────────┘

Finding top commenters by score

Here we find the top 5 commenters with the highest cumulative scores. In this case the .topk shorthand won’t work and we’ll need to write out the full group_by -> agg -> order_by -> limit pipeline.

top_commenters_by_score = (
    t.filter(_.by != "")
    .group_by("by")
    .agg(total_score=_.score.sum())
    .order_by(ibis.desc("total_score"))
    .limit(5)
)

top_commenters_by_score
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ by            total_score ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ !stringint64       │
├──────────────┼─────────────┤
│ ingve       405248 │
│ tosh        301694 │
│ pseudolus   295067 │
│ Tomte       276776 │
│ todsacerdoti269524 │
└──────────────┴─────────────┘

Next Steps

There are lots of other interesting queries one might ask of this dataset.

A few examples:

  • What posts had the most comments?
  • How do post scores fluctuate over time?
  • What day of the week has the highest average post score? What day has the lowest?

To learn more about how to use Ibis with Clickhouse, see the documentation.

Back to top