Hacker News new | ask | show | jobs
by YZF 407 days ago
How do you avoid creating duplicate rows in ClickHouse?

- What happens when your insertion fails but some of the rows are actually still inserted?

- What happens when your de-duplication server crashes before the new offset into Kafka has been recorded but after the data was inserted into ClickHouse?

2 comments

- We used our custom clickhouse sink which inserts records in batches using clickhouse native protocol (as recommend by clickhouse). Each insert is done in a single transaction so if an insertion has failed, partial record do not get inserted on clickhouse. - The way the system is architected this cannot happen. If the deduplication server crashes, the entire pipeline is stopped and nothing is inserted. Currently when we read a data successfully from Kafka into our internal NATs JS, we acknowledge the new offset into Kafka. And the deduplication and insertion happens after. The limitation currently is that if our system crashes before inserting into clickhouse (but after ack to kafka) we would not process this data. We are already working towards finding a solution for this.
Right. I think this is fundamental though. You can minimize the chance of duplicates but not avoid them completely under failure given ClickHouse's guarantees. Also note transactions have certain limitations as well (re: partitions).

I'm curious who your customers are. I work for a large tech company and we use Kafka and ClickHouse in our stack but we would generally build things in house.

ClickHouse provides idempotent atomic inserts, so the insertion of the same batch can be safely repeated. The client can provide its own idempotency token or rely on the block hash.
Afaik this is always best effort, e.g.: https://clickhouse.com/docs/operations/settings/settings#ins...

"For the replicated tables by default the only 100 of the most recent blocks for each partition are deduplicated"

This doesn't work under failure conditions either (again afaik), e.g. if the clickhouse server fails.

100 is the default, and can be changed at runtime.

The deduplication works regardless of server restarts, and it does not matter when a request goes to another replica, as it is implemented with a distributed consensus (RAFT) via clickhouse-keeper.

ah. interesting. So some hash of the batch is recorded in the distributed log after the batch has been written? to disk? Isn't there still a race there?

At least intuitively this seems very hard to guarantee something more than "at least once" but I might be missing something.

It is more complex.

The batch is written to a temporary directory. Then we have to do an atomic commit into three different places: on disk, which can be external storage as well (rename the temporary directory), in memory (to the data structure containing the snapshot), and in Keeper (which contains metadata, including these hashes, and is the only source of truth).

The metadata in Keeper is the only place that decides which data exists and when. The whole operation is done by committing the data files first, then committing to Keeper, then committing to memory, and then responding to the client.

Now you have to analyze what happens, and what have to be resolved, if the server is killed between these steps.

If it is killed after writing data files and before writing to Keeper, the data is not considered written, but we have garbage in the storage to collect.

If it is killed after writing data files and after writing to Keeper, but before committing to memory or answering to the client, the data is written, but the client does not know that. This is the situation when the client has to retry, and the retry will be deduplicated.

If the data is written to the storage, but the transaction to Keeper (the only source of truth) did not go through due to a network error after subsequent retries, the changes in the storage will be attempted to roll back straight ahead, and the client will receive an exception. If the client will not receive an exception due to another network error, the client will have to retry.

It is quite easy, if you look at it as Keeper is the central place, and the only place that does distributed transactions, and everything else is around it.