|
|
|
|
|
by oulipo
407 days ago
|
|
Seems interesting, but I'm not sure what duplication means in this context? Is Kafka sending several time the same row? and for what reasons? Could you give practical examples where duplication happens? My use-case is IoT with devices connecting on MQTT and sending batches of data, each time we ingest a batch we stream all corresponding rows in database, because we only ingest a batch once, I don't think there can really be duplicates, so I don't think I would be the target of your solution, but I'm still curious at in which case such things happen, and why couldn't Kafka or Clickhouse dedup themselves using some primary key or something? |
|
ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.
With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.
In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.