Hacker News new | ask | show | jobs
by ryanbooz 1697 days ago
Hello @PeterZaitsev!

Actually Altinity is the one that contributed the bits to TSBS for benchmarking ClickHouse[1], so we are using the work that they contributed (and anyone is welcome to make a PR for updates or changes). We also had a former ClickHouse engineer look at the setup to verify it matched best practices with how CH is currently designed, given the TSBS dataset.

As for the optimizations in the article you pointed to from 2019 (specifically how to query "last point" data more efficiently in ClickHouse), it uses a different table type (AggregatedMergeTree) and a materialized view to get better query response times for this query type.

We (or someone in the community) could certainly add that optimization to the benchmark, but it wouldn't be using raw data - which we didn't think was appropriate for the benchmark analysis. But if one wanted to use that optimization, then one should also use Continuous Aggregates for TimescaleDB - ie for an apples to apples comparison - which I think would also lead to similar results to what we show today.

It's actually something we've talked about adding to TSBS for TimescaleDB (as an option to turn on/off) and maybe other DBs could do the same.

[1]: https://github.com/timescale/tsbs/pull/26

1 comments

Thank you for your prompt response!

I think the most important thing is Clickhouse is NOT designed for small batch insertion, if you need to do 1000s of Inserts/sec you do queue in front of clickhouse. And query speed can be impacted by batch side a lot. So have you looked at query performance with optimal batch size ?

Yep! The blog post includes data and graphs from both large (5000-15,000 rows / batch) and small (100-500 rows / batch) sizes. Please see the section "Insert Performance". Thanks!

https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

1) This is also small batch size. If you're inserting 500.000 rows/sec 5000 rows is not particularly large batch size

2) I see different graphs for ingest but not for queries. The data layout will depend on the batch size, unless of course you did OPTIMIZE before running queries

1) you're absolutely right. 5k rows isn't "large". We also mentioned that we did hundreds of tests often going between 5k and 15k rows/batch. The overall ingest/query cycle didn't change dramatically in any of these. That is, 5k rows was within a few percentage of 10k rows. Interestingly, the benchmarks that Altinity has, only used 10k rows/sec (which we also did, it just didn't have any major impact in the grand scheme of things).

2) We did not specifically call OPTIMIZE before running queries. Again, learning from the leaders at Altinity and their published benchmarks, I don't see any references that they did either, and neither does the TSBS code appear to call it after ingest.

Happy to try both of these during our live stream next week to demonstrate and learn!

Altinity benchmark (10k rows/batch mention): https://altinity.com/blog/clickhouse-for-time-series