Hacker News new | ask | show | jobs
by goenning 937 days ago
If your ClickHouse ReplacingMergeTree returns twice the expected row count is because your query is wrong. You don’t need to FINAL it, just use aggregation on your queries as per their docs
2 comments

Hi. Sorry if my query offended you.

I basically executed literally what Clickhouse recommends at their guides for deduplication https://clickhouse.com/docs/en/guides/developer/deduplicatio....

Of course you can also materialize with aggregations or just use a group by, or even force optimize of the table. But my point is that you don't really get exactly once guarantees. Whoever is querying that table needs to be aware than a `SELECT * FROM tb` might contain duplicates and needs to adapt their queries accordingly.

I believe there are 0 people working with CH and ReplacingMergeTree and don’t know that they have to use final or group by in order to get non duplicate data. It’s mentioned in the table engine page, their knowledge base everywhere.

Also i have not recently seen anyone not recommending it. It might have been the case a few years ago, but performance of final has improved and it’s faster than alternatives. People suggest to use MergeTrees obviously but if no alternative, replacing is the way to go.

Indeed, you should still aggregate even on mergetree tables.

I'm not sure what is about the database world where people are happy to discuss their competitors and include either mistakes or misinformation. It doesn't seem to happen in other parts of the industry.