Hacker News new | ask | show | jobs
by thomoco 1352 days ago
I wanted to note that ClickHouse Cloud results are now also being reported in the public ClickBench results: https://benchmark.clickhouse.com/

Good to see transparent comparisons available now for Cloud performance vs. self-hosted or bare metal results as well as results from our peers. The ClickHouse team will continue to optimize further - as scale and performance is a relentless pursuit here at ClickHouse, and something we expect to be performed transparently and in a reproducible manner. Public benchmarking benefits all of us in the tech industry as we learn from each other in sharing the best techniques for attaining high performance within a cloud architecture

Full disclosure: I do work for ClickHouse, although have also been a past member of SPEC in developing and advocating for public, standardized benchmarks

9 comments

To help understand the results of the benchmark, I find it helpful to look at how the benchmark is constructed, and what it tests for. From the README:

"The dataset is represented by one flat table. This is not representative of classical data warehouses, which use a normalized star or snowflake data model. The systems for classical data warehouses may get an unfair disadvantage on this benchmark."

Taking a look at the queries [0], it looks like it mostly consists of full table scans with filters, aggregations, and sorts. Since it's a single table, there are no joins.

[0]: https://github.com/ClickHouse/ClickBench/blob/main/snowflake...

Can you clarify what a "write unit" is? Naively it sounds like it might be blocks x partitions x replicas that actually hit disk. (Which is also probably not very clear to people not already using CH, but I have at least middling knowledge of CH's i/o patterns and I have no clue what a "write unit" is from the page's description.)
One write unit is around 100..200 INSERT queries.

If you are doing INSERT in batches with one million rows, it will give

    SELECT formatReadableQuantity(1000000 * 100 / 0.0125)
    
    8.00 billion
inserted rows per dollar. Pretty good, IMO.

If you are doing millions of INSERT queries with one record, without "async_insert" setting, it will cost much more.

That's why we have "write units" instead of just counting inserts.

More helpful would be answers to my questions at https://news.ycombinator.com/item?id=33081502 - async_insert is a relatively new feature, we're still using buffer tables for example - but also most of our "client" inserts are actually onto multi-MV-attached null engines. Those MVs are also often doing some pre-aggregation before hitting our MTs as well. So we might insert a million rows, but the MV aggregates that down into 50k, but then that gets inserted into five persistent tables, each of which has its own sharding/partitioning so that blows up to 200k or something "rows" again. (And at some point those inserts are also going to get compacted into stuff inserted previously / concurrently by the MT itself.)

As I've said several times in this thread, I understand why you don't count inserts or rows. What I don't understand is what unit a WU does actually correspond to. In particular I don't understand its relation to e.g. parts or blocks, which are the units one would focus on optimizing self-hosted offerings.

I think optimizations that you focus on for self-hosted ClickHouse are the same as for Cloud. In self-hosted it helps to improve your throughput/capacity with fixed allocated resources. In cloud it directly affects cost.

For those complex pipelines you may find more useful to run tests during trial. Data distribution, partitioning and so on can change actual cost significantly so estimates can be too pessimistic or optimistic

> For those complex pipelines you may find more useful to run tests during trial.

Right, that's exactly what I don't want to deal with. Unless I have even just a ballpark estimate of complex pipelines both before I commit to any sales crap and afterwards when we're designing new pipelines, it's just not an option for us at all. I have no clue if it's going to cost us $10, $100, or $10000.

It's Tyler from ClickHouse.

Check out the response below that has a reference to some of our billing FAQs.

It doesn't mention anything about what a write unit is, except to say you can reduce write units by batching inserts (that part I guessed already.)

There's no way to think about what an actual write unit means. You could measure the costs on a sample workload, but that's far from ideal. Some transparency here would be nice.

I understand the answer is complicated, based on hairy implementation details, and subject to change. Give me the complexity and let me interpret it according to my needs.

Absolutely.

Working on updating the FAQ and tooltips now and sharing your feedback. <3

Right, that link covers read units which is also what I expected - essentially the number of files I have to touch - but I still have no clue about write units.

Is one block on one non-partitioned non-distributed table one write unit? What about one insert that's two blocks on such a table? What about one block on a null engine with two MVs listening to insert into two non-partitioned non-distributed tables? What if the table is a replacing mergetree, do I incur WUs for compactions? etc.

My worry is that it is essentially 1 WU = 1 new part file, which I understand makes sense to bill on but is tremendously intransparent for users - at least I have no clue how often we roll new part files, instead I'm focused on total network and disk i/o performance on one side and client query latency on the other.

I may assure you that 1WU is not 1 part. Not even close. You can check it using trial credits with your data.

For example, I just checked that uploading 1.1GB example table(cell_towers with 14 columns) cost me 0.38 write units.

Then I'm even more confused, because the pricing page clearly says write operations consume at least one WU.
With analytical column store DBs the standard is to do massive batches writes of thousands to millions of records at a time, vs. inserting individual records. Inserting individual records is basically always crazy inefficient with column stores. So a single write is generally for thousands to millions of records.
Where does it say that? The pricing page says on "Writes" in the info tooltip: "Each write operation (INSERT, DELETE, etc) consumes write units depending on the number of rows, columns, and partitions it writes to."

This doesn't imply to me that each individual INSERT costs 1 WU, but that it could be fractional. I guess it depends on how you read it?

Is lower time the right metric here? Seems normalizing per price would make a more useful metric for big data as long as the response time is reasonable
Yes, ClickBench results are presented as Relative Time, where lower is better. You can read more on the specifics of ClickBench methodology in the GitHub repository here: https://github.com/ClickHouse/ClickBench/

There are other responses from ClickHouse in the comments on the pricing, so I'll defer to their expertise on that topic there. Thank you for your feedback and ideas, as normalizing a price-based benchmark is an interesting concept (and where ClickHouse would expect to lead also given the architecture and efficiency)

This benchmark focuses on analytical query latency for representative analytical queries, so yes - lower number is better.
Wow, I hadn't heard of StarRocks before... seems like an interesting competitor.

https://starrocks.io/blog/clickhouse_or_starrocks

See the SelectDB built from Apache Doris and by the creators of Apache Doris. the performance is amazing. https://en.selectdb.com/blog/SelectDB%20Topped%20ClickBench%...
Looks really cool! Great work!

Had a pricing question. Say we connected a log forwarder like Vector to send data to Clickhouse Cloud, once per second. If each write unit is $0.0125, and we execute 86,400 writes over the course of the day, would we end up spending $1080? Do you only really support large batch, low frequency writes?

Hi Cliff - A write unit does not correspond to a single INSERT. A single INSERT with less than 16MB of data generates ~0.01 “write unit”, so a single “write unit” typically corresponds to ~100 INSERTs. In your example, that would come closer to $11 a day. Depending on how small of batches you plan to write in that examples, there may be ways to reduce that spend further, by batching even more or turning on "async_insert" inside ClickHouse.
> and advocating for public, standardized benchmarks

For full transparency, I think you should do the same in ClickHouse. Or is there a strong reason not to run benchmarks on standard analytical workloads like TPC-H, TPC-DS or SSB?

You can't post results of TPC benchmarks without official audit. So it complicates posting results. You can't find common names that are usually compared with ClickHouse there [1]. So open standardized ClickBench tries to encourage benchmarking for everyone.

There are numerous benchmarks that use similar to TPC queries, but those are not standardized and can be misleading. For example a lot of work was done by Fivetran to get this report [0], but they show only overall geomean for those systems and you can't understand how they actually differ. Anyway their queries are not original TPC - variables are fixed in queries, they run first query when official query is a multiquery.

Contributors from Altinity run SSB with flattened and original schemas [2]. SSB is not well standardized and we see a lot of pairwise comparisons with controversial results - generally you can't just reproduce them and get all the results in single place for the same hardware.

[0] https://www.fivetran.com/blog/warehouse-benchmark [1] https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby... [2] https://altinity.com/blog/clickhouse-nails-cost-efficiency-c...

There is a good reference to the available benchmarks for analytical databases: https://github.com/ClickHouse/ClickBench#similar-projects
On couple of occasions I've seen TPC-H benchmarks with the remark that the results are not audited. Is that not possible?
License states the following. All other modifications are not standardized and you can't just compare systems. Otherwise there would be another standardized benchmark in the list you propose to run and publish.

>c. Public Disclosure: You may not publicly disclose any performance results produced while using the Software except in the following circumstances: (1) as part of a TPC Benchmark Result. For purposes of this Agreement, a "TPC Benchmark Result" is a performance test submitted to the TPC, documented by a Full Disclosure Report and Executive Summary, claiming to meet the requirements of an official TPC Benchmark Standard. You agree that TPC Benchmark Results may only be published in accordance with the TPC Policies. viewable at http: //www.tpc.org (2) as part of an academic or research effort that does not imply or state a marketing position (3) any other use of the Software, provided that any performance results must be clearly identified as not being comparable to TPC Benchmark Results unless specifically authorized by TPC.

I see, thanks for the context, it seems like a PITA.

But given that each database system has its own flavor of SQL, vanilla TPC benchmarks may not work out of the box so one needs to tweak them a bit and this might be what actually disqualifies the published results from all of the clauses from above being applicable.

I can also anticipate that combination of clause (2) and (3) is what some that publish the results are also taking advantage of.

[1] https://www.oracle.com/mysql/heatwave/performance/ [2] https://www.singlestore.com/blog/tpc-benchmarking-results/ [3] https://docs.pingcap.com/tidb/v6.2/v5.4-performance-benchmar... [4] https://www.monetdb.org/blogs/learning-from-benchmarking/

Why are you using 'threads' instead of vcpus or aws instances like it was for other benchmarks? Thats really hard to compare and add suspicions here.
It is related to the "max_threads" setting of ClickHouse, and by default, it is the number of physical CPU cores, which is twice lower as the number of vCPUs.

For example, the c6a.4xlarge instance type in AWS has 16 vCPUs, 8 cores and "max_threads" in ClickHouse will be 8.

Interesting set of results. Ignoring ClickHouse, StarRocks seems to be better in almost all metrics.

I was curious to compare MonetDB, DuckDB, ClickHouse-Local, Elasticsearch, DataFusion, QuestDB, Timescale, and Athena. Amazingly, MonetDB shows up better than DuckDB in all metrics (except storage size), and Athena holds its own and fares admirably well, esp given that it is stateless. While, Timescale and Quest did not come up as good as I hoped they would.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

It'd be interesting to see how rockset, starburst (presto/trino), and tiledb fare, if and when they get added to the benchmark.

The particular way in which the data is loaded into DuckDB and the particular machine configuration on which it is run triggers a problem in DuckDB related to memory management. Essentially the standard Linux memory allocator does not like our allocation pattern when doing this load, which causes the system to run out-of-memory despite freeing more memory than we allocate. More info is provided here [1].

As it is right now the benchmark is not particularly representative of DuckDB's performance. Check back in a few months :)

[1] https://github.com/duckdb/duckdb/issues/3969#issuecomment-11...

Thanks. Btw, we use DuckDB (via Node/Deno) for analytics (on Parquet/JSON), and so I must point out that despite the dizzying variation among various language bindings (cpp and python seem more complete), the pace of progress, given the team size, is god-like. It has been super rewarding to follow the project. Also, thanks for permissively licensing it (unlike most other source-available databases).

Goes without saying, if there are cost advantages to be had due to DuckDB's unique strengths, then serverless DuckDB Cloud couldn't come here soon enough.

> despite freeing more memory than we allocate

> despite DuckDB freeing more buffers than it is allocating

Can you please clarify how is that even possible?

We are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate, memory usage might still increase because of internal fragmentation in the allocator. Essentially, fragmentation might create "unused" space that does take up space. This phenomenon is called heap fragmentation [1].

[1] https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentatio...

> Despite freeing more buffers than we allocate

Technically, I hope you understand that this isn't possible but maybe I am misinterpreting what you're trying to say.

  auto buff = malloc(N);
  free(buff);
  free(buff);
is one way to free "more" buffers than allocated but this will lead to an UB and depending on the underlying system allocator implementation it may or may not crash.

However, given how silly this would be I believe this is not what you're trying to convey?

Here's what mytherin wrote, ...we are allocating and freeing buffers repeatedly. Despite freeing more buffers than we allocate...

So, I assume, the context is, DuckDB allocates x buffers, frees x - m buffers at some point later, then allocates n buffers where n <<<< m, and yet malloc fails.

In the GitHub thread mytherin linked to above, Alexey Milovidov, ClickHouse CTO, points out that ClickHouse uses jemalloc and makes for a better choice than glibc malloc given the issue with fragmentation. It is likely that DuckDB switches to jemalloc, too.

You are misinterpreting it indeed.

The scenario I am describing is roughly the following:

Suppose we allocate 100K buffers that all have an equal size, and our memory usage is now 10GB. After that point we free 20K buffers, but allocate 10K more. In other words, from that point on we are freeing more buffers than we are allocating.

Now, since we are freeing more than we are allocating, you would expect our memory usage to go down. However, when using the standard glibc malloc on Linux, our memory usage unexpectedly goes up. After this happens several times in a row the system runs out of memory and new calls to malloc fails.

surprised Spark isn't there