Hacker News new | ask | show | jobs
by morelisp 1352 days ago
Can you clarify what a "write unit" is? Naively it sounds like it might be blocks x partitions x replicas that actually hit disk. (Which is also probably not very clear to people not already using CH, but I have at least middling knowledge of CH's i/o patterns and I have no clue what a "write unit" is from the page's description.)
2 comments

One write unit is around 100..200 INSERT queries.

If you are doing INSERT in batches with one million rows, it will give

    SELECT formatReadableQuantity(1000000 * 100 / 0.0125)
    
    8.00 billion
inserted rows per dollar. Pretty good, IMO.

If you are doing millions of INSERT queries with one record, without "async_insert" setting, it will cost much more.

That's why we have "write units" instead of just counting inserts.

More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)

As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.

I think optimizations that you focus on for self-hosted ClickHouse are the same as for Cloud. In self-hosted it helps to improve your throughput/capacity with fixed allocated resources. In cloud it directly affects cost.

For those complex pipelines you may find more useful to run tests during trial. Data distribution, partitioning and so on can change actual cost significantly so estimates can be too pessimistic or optimistic

> For those complex pipelines you may find more useful to run tests during trial.

Right, that's exactly what I don't want to deal with. Unless I have even just a ballpark estimate of complex pipelines both before I commit to any sales crap and afterwards when we're designing new pipelines, it's just not an option for us at all. I have no clue if it's going to cost us $10, $100, or $10000.

It's Tyler from ClickHouse.

Check out the response below that has a reference to some of our billing FAQs.

It doesn't mention anything about what a write unit is, except to say you can reduce write units by batching inserts (that part I guessed already.)

There's no way to think about what an actual write unit means. You could measure the costs on a sample workload, but that's far from ideal. Some transparency here would be nice.

I understand the answer is complicated, based on hairy implementation details, and subject to change. Give me the complexity and let me interpret it according to my needs.

Absolutely.

Working on updating the FAQ and tooltips now and sharing your feedback. <3

Right, that link covers read units which is also what I expected - essentially the number of files I have to touch - but I still have no clue about write units.

Is one block on one non-partitioned non-distributed table one write unit? What about one insert that's two blocks on such a table? What about one block on a null engine with two MVs listening to insert into two non-partitioned non-distributed tables? What if the table is a replacing mergetree, do I incur WUs for compactions? etc.

My worry is that it is essentially 1 WU = 1 new part file, which I understand makes sense to bill on but is tremendously intransparent for users - at least I have no clue how often we roll new part files, instead I'm focused on total network and disk i/o performance on one side and client query latency on the other.

I may assure you that 1WU is not 1 part. Not even close. You can check it using trial credits with your data.

For example, I just checked that uploading 1.1GB example table(cell_towers with 14 columns) cost me 0.38 write units.

Then I'm even more confused, because the pricing page clearly says write operations consume at least one WU.
With analytical column store DBs the standard is to do massive batches writes of thousands to millions of records at a time, vs. inserting individual records. Inserting individual records is basically always crazy inefficient with column stores. So a single write is generally for thousands to millions of records.
Buddy, if you look just a couple posts up you'll see me comment on how ClickHouse's actual disk format works. You don't need to explain batching to me.

Nonetheless you can't insert a-whole-file-and-just-that-file in less than one write.

Where does it say that? The pricing page says on "Writes" in the info tooltip: "Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."

This doesn't imply to me that each individual INSERT costs 1 WU, but that it could be fractional. I guess it depends on how you read it?

The tooltip has been changed since my comment was posted; it's now not incorrect, but it still doesn't really tell me more useful information.

(See https://news.ycombinator.com/item?id=33081099 for the original wording.)