Hacker News new | ask | show | jobs
by joshstrange 806 days ago
I feel like with all the Clickhouse praise on HN that we /must/ be doing something fundamentally wrong because I hate every interaction I have with Clickhouse.

* Timeouts (only 30s???) unless I used the cli client

* Cancelling rows - Just kill me, so many bugs and FINAL/PREWHERE are massive foot-guns

* Cluster just feels annoying and fragile don't forget "ON CLUSTER" or you'll have a bad time

Again, I feel like we must be doing something wrong but we are paying an arm and a leg for that "privilege".

7 comments

What is your use case? If you're deleting rows that already feels like maybe it's not the intended use case. I think about clickhouse as taking in a firehose of immutable data that you want to aggregate/analyze/report on. Let's say a million records per second. I'll make up an example, the orientation, speed and acceleration of every Tesla vehicle in the world in real time every second.
It's to power all our analytics. We ETL data into it and some data is write-once so we don't have updates/deletes but a number of our tables have summary data ETL'd into them which means cleaning up the old rows.

I'm sure CH shines for insert-only workloads but that doesn't cover all our needs.

You have already gotten excellent options from the other comments, but here's another one that's not been mentioned yet.

You may want to consider adjusting your partition key (if feasible) as a function of datetime so you can just drop a complete partition when required, rather than needing separate delete queries.

In my experience, it has proven to be a very quick and clean way to clear out older data.

Have you looked into the ReplacingMergeTree table engine? (Although we still needed to use FINAL with this one)
You can always use different databases for different use cases.

There are many applications that require extremely high insertion rates (millions of records per second), very large total number of rows (billions, trillions) and flexible/fast querying/aggregation with high read rates (100's of millions or higher rows/s) and that's sort of the sweet spot IMO for ClickHouse and where you'll be pressed to find alternatives. I'm sure it can be used in other situations but maybe there are more choices if you're in those.

>You can always use different databases for different use cases.

Unfortunately this is not always realistic, especially in large organizations, I know where I am there is a big push from top (i.e the IT budget people) to standardize everything they want to simplify licenses, support contracts etc.

I may not be doing cutting edge stuff (I work at an Industrial plant) but we do have mixed data use cases where it could be beneficial to use different dbs but realistically I don't see it happening.

CH works just fine for cleaning up rows: Delete with mutations sync=1, or use optimize with deduplicate by, or use aggregate trees and optimize final, or query aggregate tables with final=1.

Numerous ways to achieve removal of old/stale rows.

Sounds like you need to use a ReplacingMergeTree + final keyword.
Tableau
Most, though certainly not all, problems I see with ClickHouse usage come from pretending it is another database or that it is intended for other use cases.
> * Timeouts (only 30s???) unless I used the cli client

Almost all clients (client libraries) allow a configurable timeout. In server settings there is a max query time settings which can be adjusted if necessary: https://clickhouse.com/docs/en/operations/settings/query-com...

From the docs on FINAL:

> However, using FINAL is sometimes necessary in order to produce accurate results

Welp.

If you use tables like “ReplacingMergeTree” which _explicitly_ states that merges happen in the background, and non-merged rows _will_ be visible.

It’s a table design optimised for specific workloads, and the docs and design detail those tradeoffs.

We use it at work for workloads that can tolerate “retreading” over stale data, because it means they can efficiently write to the db without round tripping, or locking and row updates, and without the table growing massive. It works fantastically in our use case.

It's meant to store immutable data, and isn't great if you need low-latency updates. Also it's quirky in some ways.
> It's meant to store immutable data

I don't disagree, I feel like we might be using it wrong. We were trying to replace ES with it but it just doesn't feel like it fits our needed usecase.

How many rows do you have on average per day?
A couple million (<10M), I don't have a better number available right now. Not all (or even most) of those need cancelling rows thankfully.
You don’t need a cluster nor should you be having any issues you mentioned. I run 10x that volume daily on a single gcp box (8 core / 64GB). We migrated off BigQuery and went from $10k/mo to about $250/mo. And it’s faster for both low-latency and big slow queries.
The plan is to 10x that volume in the not too distant future but given what you've said I can believe we are horribly over-provisioned/over-scaled. Thank you!
Interesting about "Timeouts (only 30s???)" - most likely, this is a limitation configured explicitly for a user on your server. You can set it up with the `max_execution_time`, and by default, it is unlimited.

For example, I've set it up, along with many more limitations for my public playground https://play.clickhouse.com/, and it allows me to, at least, make it public and not worry much.

It could also be a configuration of a proxy if you connect through a proxy. ClickHouse has built-in HTTP API, so you can query it directly from the browser or put it behind Cloudflare, etc... Where do you host ClickHouse?

I can believe it's a config issue, I'll have to look into it. I didn't setup the cluster/dbs and when I asked about I was told "use the cli". I'll try to see if I can get that fixed.
Any chance you CH is proxied through a Heroku app? Heroku has 30s timeouts.
What foot guns have you run into with FINAL?
Just forgetting to use it or PREWHERE. Since queries run just fine without those you can think you have something working when you actually have duplicate rules.