|
|
|
|
|
by maccard
609 days ago
|
|
Thanks for the well thought out reply here. I understand the solution you're proposing, but the thing is that it fails at the first hurdle. > 1. Aggregate (in-memory or on cheap storage) events in the publisher application into batches. Clickhouse's Tagline on their website is: > Build real-time data products that scale Except, the minute we start having to batch data to process it and stage it, we lose the "real time" part. If I'm shipping them to S3 to have clickhouse batch ingest them, I might as well be use Databricks, Snowflake, or just parquet-on-s3. |
|
Even if you could remove all of the operational burden from Kafka or equivalent, hooking it up to Clickhouse is still, at the end of the day, going to commit in batches (of max_insert_block_size, or kafka_max_block_size, or smaller batches polled from the message broker). Even with no consumer lag, that's still going to incur a delay before your data is SELECTable.
Heck, even Kafka publishers usually don't flush (actually send over the network) after every publish by default.
That same tradeoff comes up in Snowflake and Databricks (albeit mitigated when using Continuous Processing, which is experimental and expensive computationally and monetarily). Their ingestion systems are batching as well.
At the end of the day, "real time" means different things to different people, and you'll have to choose between one of several architectures:
- Clients synchronously insert data (which is then immediately visible) into your analytics store. ClickHouse is less good at handling a barrage of single-row INSERTs than other DBs, but none of them are good at this type of workload at even medium scale. Even manually shipping single-update files to S3 gets expensive and slow fast.
- Batch your inserts and accept bounded lag in data visibility. Doesn't matter whether batching is client-side, database-side, or in an intermediate broker/service.
- Ship your data asynchronously via messaging/streaming/batching and force point-in-time queries to wait for some indication that asynchronous data for the requested point in time has arrived. For example, when batching manually you could delay queries until a batch subsequent to the time-of-query has arrived, or when using Kafka you could wait for the system of record's last-committed-kafka-message-id to pass your topic's max ID at the time of query.