Hacker News new | ask | show | jobs
by maxmcd 587 days ago
Using duckdb and apache iceberg means that you can run read replicas without any operational burden. Clickhouse is amazing, but they do not allow you to mount dumb read replicas to object storage (yet).

I can imagine this product is a very elegant solution for many types of companies/teams/workloads.

1 comments

You can mount read replicas on object storage in ClickHouse.

Example:

    CREATE DATABASE test;
    USE test;

    CREATE TABLE hackernews_history UUID '66491946-56e3-4790-a112-d2dc3963e68a'
    (
        `update_time` DateTime DEFAULT now(),
        `id` UInt32,
        `deleted` UInt8,
        `type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        `by` LowCardinality(String),
        `time` DateTime,
        `text` String,
        `dead` UInt8,
        `parent` UInt32,
        `poll` UInt32,
        `kids` Array(UInt32),
        `url` String,
        `score` Int32,
        `title` String,
        `parts` Array(UInt32),
        `descendants` Int32
    )
    ENGINE = ReplacingMergeTree(update_time)
    ORDER BY id
    SETTINGS disk = disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false);
And you can try it right now.

Install ClickHouse:

    curl https://clickhouse.com/ | sh
    ./clickhouse local
Run the query above to attach the table.

The table is updated in real time. For example, here is your comment:

    :) SELECT * FROM hackernews_history WHERE text LIKE '%Clickhouse is amazing%' ORDER BY update_time \G

    Row 1:
    ──────
    update_time: 2024-04-06 16:35:28
    id:          39785472
    deleted:     0
    type:        comment
    by:          mightybyte
    time:        2024-03-21 22:59:20
    text:        I&#x27;ll second this.  Clickhouse is amazing.  I was actually using it today to query some CSV files.  I had to refresh my memory on the syntax so if anyone is interested:<p><pre><code>  clickhouse local -q &quot;SELECT foo, sum(bar) FROM file(&#x27;foobar.csv&#x27;, CSV) GROUP BY foo FORMAT Pretty&quot;
    </code></pre>
    Way easier than opening in Excel and creating a pivot table which was my previous workflow.<p>Here&#x27;s a list of the different input and output formats that it supports.<p><a href="https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;interfaces&#x2F;formats" rel="nofollow">https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;interfaces&#x2F;formats</a>
    dead:        0
    parent:      39784942
    poll:        0
    kids:        [39788575]
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    Row 2:
    ──────
    update_time: 2024-04-06 18:07:34
    id:          31334599
    deleted:     0
    type:        comment
    by:          richieartoul
    time:        2022-05-11 00:54:31
    text:        Not really. Clickhouse is amazing, but if you want to run it at massive scale you’ll have to invest a lot into sharding and clustering and all that. Druid is more distributed by default, but doesn’t support as sophisticated of queries as Clickhouse does.<p>Neither Clickhouse nor Druid can hold a candle to what Snowflake can do in terms of query capabilities, as well as the flexibility and richness of their product.<p>That’s just scratching the surface. They’re completely different product categories IMO, although they have a lot of technical &#x2F; architectural overlap depending on how much you squint.<p>Devil is in the details basically.
    dead:        0
    parent:      31334527
    poll:        0
    kids:        [31334736]
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    Row 3:
    ──────
    update_time: 2024-11-07 22:29:09
    id:          42081672
    deleted:     0
    type:        comment
    by:          maxmcd
    time:        2024-11-07 22:13:12
    text:        Using duckdb and apache iceberg means that you can run read replicas without any operational burden. Clickhouse is amazing, but they do not allow you to mount dumb read replicas to object storage (yet).<p>I can imagine this product is a very elegant solution for many types of companies&#x2F;teams&#x2F;workloads.
    dead:        0
    parent:      42080385
    poll:        0
    kids:        []
    url:         
    score:       0
    title:       
    parts:       []
    descendants: 0

    3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)
    Peak memory usage: 579.26 MiB.
When I try your code I get this, any idea?

Query id: daa202a3-874c-4a68-9e3c-974560ba4624

Elapsed: 0.092 sec.

Received exception: Code: 499. DB::Exception: The AWS Access Key Id you provided does not exist in our records. (Code: 23, S3 exception: 'InvalidAccessKeyId'): While processing disk(readonly = true, type = 's3_plain_rewritable', endpoint = 'https://clicklake-test-2.s3.eu-central-1.amazonaws.com/', use_environment_credentials = false). (S3_ERROR)

It can be a wrong AWS profile in your configuration. In that case, you can correct the configuration, or use an empty server, docker container, or even CloudShell.
Very interesting, can you give more info on how this could be used for instance in my IoT case where I want to keep the last 3 months (say) of data in Postgres, and dump old data in parquet/iceberg on S3, and be able to do analytical queries on the past data? Would that be hard to do?

And how does the real-time update work? Could I make it so that my latest data is incrementally sync'd on S3 (eg "the last 3-months block" is incrementally updated efficiently each time there is new data) ?

Do you have example code / setup for this?

You can store all data in ClickHouse (on S3 or on local storage); there is no need to separate historical and real-time data.

To insert data into ClickHouse, you use the INSERT query to insert data as frequently as you'd like.

Alternatively, you can set up continuous replication from Postgres to ClickHouse, which is available in ClickHouse Cloud.

> 3 rows in set. Elapsed: 3.981 sec. Processed 42.27 million rows, 14.45 GB (10.62 million rows/s., 3.63 GB/s.)

Since you were running $(./clickhouse local) does that mean the query downloaded 14.45GB out of S3 to your machine? The 3.981s seems to imply "no," but I struggle to think what meaning that output would otherwise try to convey

This is the amount of processed data after decompression. The amount of compressed data is less. You can press whitespace during the query run to see the detailed metrics.

    ReadBufferFromS3Bytes                     8.95 GB
The amount of compressed data read from S3 is 8.95 GB. Note: it sounds quite large, interesting why compression is less than 2x on this dataset. Most likely, it uses just lz4.

I recommend trying it because it works everywhere. You can run ClickHouse on your laptop, on a VM, or even in Cloudshell in AWS.

Whoops, I forgot that tables maintained in the db are not the same as remote archives. :|
The setup in my example uses a table updated by one ClickHouse server and observed by an infinite amount of read-only replicas, which could run anywhere.
how about the case of a clickhouse cluster that use zk and a distributed table