Hacker News new | ask | show | jobs
by morelisp 1350 days ago
More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)

As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.

1 comments

I think optimizations that you focus on for self-hosted ClickHouse are the same as for Cloud. In self-hosted it helps to improve your throughput/capacity with fixed allocated resources. In cloud it directly affects cost.

For those complex pipelines you may find more useful to run tests during trial. Data distribution, partitioning and so on can change actual cost significantly so estimates can be too pessimistic or optimistic

> For those complex pipelines you may find more useful to run tests during trial.

Right, that's exactly what I don't want to deal with. Unless I have even just a ballpark estimate of complex pipelines both before I commit to any sales crap and afterwards when we're designing new pipelines, it's just not an option for us at all. I have no clue if it's going to cost us $10, $100, or $10000.