| HN Mirror

All very fair, though I think your issue may be more with the nature of real-time analytics ingestion pipelines in general than with Clickhouse itself.

Even if you could remove all of the operational burden from Kafka or equivalent, hooking it up to Clickhouse is still, at the end of the day, going to commit in batches (of max_insert_block_size, or kafka_max_block_size, or smaller batches polled from the message broker). Even with no consumer lag, that's still going to incur a delay before your data is SELECTable.

Heck, even Kafka publishers usually don't flush (actually send over the network) after every publish by default.

That same tradeoff comes up in Snowflake and Databricks (albeit mitigated when using Continuous Processing, which is experimental and expensive computationally and monetarily). Their ingestion systems are batching as well.

At the end of the day, "real time" means different things to different people, and you'll have to choose between one of several architectures:

- Clients synchronously insert data (which is then immediately visible) into your analytics store. ClickHouse is less good at handling a barrage of single-row INSERTs than other DBs, but none of them are good at this type of workload at even medium scale. Even manually shipping single-update files to S3 gets expensive and slow fast.

- Batch your inserts and accept bounded lag in data visibility. Doesn't matter whether batching is client-side, database-side, or in an intermediate broker/service.

- Ship your data asynchronously via messaging/streaming/batching and force point-in-time queries to wait for some indication that asynchronous data for the requested point in time has arrived. For example, when batching manually you could delay queries until a batch subsequent to the time-of-query has arrived, or when using Kafka you could wait for the system of record's last-committed-kafka-message-id to pass your topic's max ID at the time of query.