|
It is very easy to get this dataset directly from HN API.
Let me just post it here: Table definition: CREATE TABLE hackernews_history
(
update_time DateTime DEFAULT now(),
id UInt32,
deleted UInt8,
type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
by LowCardinality(String),
time DateTime,
text String,
dead UInt8,
parent UInt32,
poll UInt32,
kids Array(UInt32),
url String,
score Int32,
title String,
parts Array(UInt32),
descendants Int32
)
ENGINE = MergeTree(update_time) ORDER BY id;
A shell script: BATCH_SIZE=1000
TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"
rm -f maxitem.json
wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json
clickhouse-local --query "
SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
while read ITEMS
do
echo $ITEMS
clickhouse-client $TWEAKS --query "
INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
done
It takes a few hours to download the data and fill the table. |
I saw recursive cte blog post..but this doesn't seem to work your hn dataset
https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...
Are recursive ctes disabled on this instance or am i doing something wrong?