from ibis.interactive import *
ClickHouse
Ibis supports reading and querying data using ClickHouse as a backend.
In this example we’ll demonstrate using Ibis to connect to a ClickHouse server, and to execute a few queries.
Creating a Connection
First we need to connect Ibis to a running ClickHouse server.
In this example we’ll run queries against the publicly available ClickHouse playground server.
To run against your own ClickHouse server you’d only need to change the connection details.
= ibis.connect("clickhouse://play:clickhouse@play.clickhouse.com:443?secure=True") con
Listing available tables
The ClickHouse playground server has a number of interesting datasets available. To see them, we can examine the tables via the .tables
attribute.
This shows a list of all tables available:
con.tables
Tables
------
- actors
- all_replicas_metric_log
- benchmark_results
- benchmark_runs
- cell_towers
- checks
- cisco_umbrella
- covid
- dish
- dns
- dns2
- food_facts
- github_events
- hackernews
- hackernews_changes_items
- hackernews_changes_profiles
- hackernews_history
- hackernews_top
- hits
- lineorder
- loc_stats
- menu
- menu_item
- menu_item_denorm
- menu_page
- minicrawl
- newswire
- ontime
- opensky
- pypi
- query_metrics_v2
- rdns
- recipes
- repos
- repos_raw
- run_attributes_v1
- search_clickhouse_stackoverflow
- search_stackoverflow
- stackoverflow
- stock
- tranco
- trips
- uk_price_paid
- uk_price_paid_updater
- wikistat
- workflow_jobs
Inspecting a Table
Lets take a look at the hackernews
table. This table contains all posts and comments on Hacker News.
We can access the table by attribute as con.tables.hackernews
.
= con.tables.hackernews t
We can then take a peak at the first few rows using the .head()
method.
t.head()
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ id ┃ deleted ┃ type ┃ by ┃ time ┃ text ┃ dead ┃ parent ┃ poll ┃ kids ┃ url ┃ score ┃ title ┃ parts ┃ descendants ┃ ┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ !uint32 │ !uint8 │ !string │ !string │ !timestamp(0) │ !string │ !uint8 │ !uint32 │ !uint32 │ !array<!uint32> │ !string │ !int32 │ !string │ !array<!uint32> │ !int32 │ ├─────────┼─────────┼─────────┼─────────────┼─────────────────────┼─────────┼────────┼─────────┼─────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼────────┼─────────────────────────────────────────────────────────────┼─────────────────┼─────────────┤ │ 1 │ 0 │ story │ pg │ 2006-10-09 18:21:51 │ ~ │ 0 │ 0 │ 0 │ [15, 234509, ... +2] │ http://ycombinator.com │ 57 │ Y Combinator │ [] │ 15 │ │ 2 │ 0 │ story │ phyllis │ 2006-10-09 18:30:28 │ ~ │ 0 │ 0 │ 0 │ [] │ http://www.paulgraham.com/mit.html │ 16 │ A Student's Guide to Startups │ [] │ 0 │ │ 3 │ 0 │ story │ phyllis │ 2006-10-09 18:40:33 │ ~ │ 0 │ 0 │ 0 │ [531602] │ http://www.foundersatwork.com/stevewozniak.html │ 7 │ Woz Interview: the early days of Apple │ [] │ 0 │ │ 4 │ 0 │ story │ onebeerdave │ 2006-10-09 18:47:42 │ ~ │ 0 │ 0 │ 0 │ [] │ http://avc.blogs.com/a_vc/2006/10/the_nyc_develop.html │ 5 │ NYC Developer Dilemma │ [] │ 0 │ │ 5 │ 0 │ story │ perler │ 2006-10-09 18:51:04 │ ~ │ 0 │ 0 │ 0 │ [] │ http://www.techcrunch.com/2006/10/09/google-youtube-sign-more-separate-deals/ │ 7 │ Google, YouTube acquisition announcement could come tonight │ [] │ 0 │ └─────────┴─────────┴─────────┴─────────────┴─────────────────────┴─────────┴────────┴─────────┴─────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────┴─────────────────┴─────────────┘
Finding the highest scoring posts
Here we find the top 5 posts by score.
Posts have a title, so we:
filter
out rows that lack a titleselect
only the columns we’re interested inorder
them by score, descendinglimit
to the top 5 rows
= (
top_posts_by_score filter(_.title != "")
t."title", "score")
.select("score"))
.order_by(ibis.desc(5)
.limit(
)
top_posts_by_score
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ title ┃ score ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ !string │ !int32 │ ├─────────────────────────────┼────────┤ │ Stephen Hawking has died │ 6015 │ │ A Message to Our Customers │ 5771 │ │ Steve Jobs has passed away. │ 4338 │ │ Bram Moolenaar has died │ 4310 │ │ Mechanical Watch │ 4298 │ └─────────────────────────────┴────────┘
Finding the most prolific commenters
Here we find the top 5 commenters by number of comments made.
To do this we:
filter
out rows with no authorgroup_by
authorcount
all the rows in each grouporder_by
the counts, descendinglimit
to the top 5 rows
= (
top_commenters filter(_.by != "")
t."by")
.group_by(=_.count())
.agg(count"count"))
.order_by(ibis.desc(5)
.limit(
)
top_commenters
┏━━━━━━━━━━━━━━┳━━━━━━━┓ ┃ by ┃ count ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━┩ │ !string │ int64 │ ├──────────────┼───────┤ │ dang │ 64937 │ │ tptacek │ 61479 │ │ jacquesm │ 56408 │ │ pjmlp │ 54785 │ │ dragonwriter │ 51150 │ └──────────────┴───────┘
This query could also be expressed using the .topk
method, which is a shorthand for the above:
# This is a shorthand for the above
= t.filter(_.by != "").by.topk(5)
top_commenters
top_commenters
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ by ┃ CountStar() ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ !string │ int64 │ ├──────────────┼─────────────┤ │ dang │ 64937 │ │ tptacek │ 61479 │ │ jacquesm │ 56408 │ │ pjmlp │ 54785 │ │ dragonwriter │ 51150 │ └──────────────┴─────────────┘
Finding top commenters by score
Here we find the top 5 commenters with the highest cumulative scores.
= (
top_commenters_by_score filter(_.by != "")
t."by")
.group_by(=_.score.sum())
.agg(total_score"total_score"))
.order_by(ibis.desc(5)
.limit(
)
top_commenters_by_score
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ by ┃ total_score ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ !string │ int64 │ ├──────────────┼─────────────┤ │ ingve │ 405248 │ │ tosh │ 301694 │ │ pseudolus │ 295067 │ │ Tomte │ 276776 │ │ todsacerdoti │ 269524 │ └──────────────┴─────────────┘
Using the .topk
method, this can be written in a single line:
= t.by.topk(5, by=_.score.sum().name("total_score"))
top_commenters_by_score
top_commenters_by_score
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ by ┃ total_score ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ !string │ int64 │ ├──────────────┼─────────────┤ │ ingve │ 405248 │ │ tosh │ 301694 │ │ pseudolus │ 295067 │ │ Tomte │ 276776 │ │ todsacerdoti │ 269524 │ └──────────────┴─────────────┘
Next Steps
There are lots of other interesting queries one might ask of this dataset.
A few examples:
- What posts had the most comments?
- How do post scores fluctuate over time?
- What day of the week has the highest average post score? What day has the lowest?
To learn more about how to use Ibis with Clickhouse, see the documentation.