Hacker News new | ask | show | jobs
by breadwinner 600 days ago
If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

4 comments

> ClickHouse was designed for single-machine installations

This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.

On a side note, can someone please comment on this part

> for example if you add additional nodes it is not easy to redistribute data.

This is precisely one of the issues I predict we'll face with our cluster as we're ramping up OTEL data and it's being sent to a small cluster, and I'm deathly afraid that it will continue sending to the every shard in equal measure without moving around existing data. I can not find any good method of redistributing the load other than "use the third party backup program and pray it doesn't shit the bed".

It's like saying that postgres was designed for distributed setups, just because there are large postgres installations. We all understand that clickhouse (and postgres) are great databases. But it's strange to call them designed for distributed setups. How about insertion not through a single master? Scalable replication? And a bunch of other important features -- not just the ability to keep independent shards that can be queried in single query
ClickHouse does not have a master replica (every replica is equal), and every machine processes inserts in parallel. It allocates block numbers through the distributed consensus in Keeper. This allows for a very high insertion rate, with several hundred million rows per second in production. The cluster can scale both by the number of shards and by the number of replicas per shard.

Scaling by the number of replicas of a single shard is less efficient than scaling by the number of shards. For ReplicatedMergeTree tables, due to physical replication of data, it is typically less than 10 replicas per shard, where 3 replicas per shard are practical for servers with non-redundant disks (RAID-0 and JBOD), and 2 replicas per shard are practical for servers with more redundant disks. For SharedMergeTree (in ClickHouse Cloud), which uses shared storage and does not physically replicate data (but still has to replicate metadata), the practical number of replicas is up to 300, and inserts scale quite well on these setups.

Absolutely incorrect. ClickHouse was created by Yandex and it's cluster ready from day one.
Or Apache Doris, which sounds more Clickhouse-y in its performance properties from what I've read

(disclaimer: I have not used either yet)

Plus it has a MySQL-flavoured client connector where Clickhouse does its own thing, so may be easier to integrate with some existing tools.

What's the use case? Analytics on humongous quantities of data? Something besides that?
Use case is "user-facing analytics", for example consider ordering food from Uber Eats. You have thousands of concurrent users, latency should be in milliseconds, and things like delivery time estimate must updated in real-time.

Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.

Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ

I don't think Uber's estimated time-to-arrival is a statistic on which a database vendor, or development team, should brag about. It's horribly imprecise.
Also isn't something that a (geo)sharded postgres DB with the appropriate indexes couldn't handle with aplomb. Number of orders to a given restaurant can't be more than a dozen a minute or so.
Especially as restaurants have a limit on their capacity to prepare food. You can't just spin up another instance of a staffed kitchen. Do these mobile-food-ordering apps include any kind of backdown on order acceptance e.g. "Joe's Diner is too busy right now, do you want to wait or try someplace else?"
Sometimes you’ll also have a situation where your food is prepared quickly but no drivers want to pick up the food for a while.

At least it used to be like that a few years ago.

https://www.reddit.com/r/UberEATS/comments/nucd2x/no_tip_no_...

https://www.reddit.com/r/UberEATS/comments/rtn2xe/no_tip_no_...

https://www.reddit.com/r/UberEATS/comments/uce6cs/orders_sit...

https://www.reddit.com/r/doordash/comments/17ojre0/doordash_...

https://www.reddit.com/r/doordash/comments/np6rik/you_have_e...

https://www.reddit.com/r/doordash/comments/o32nl4/no_tip_ord...

Don’t know if the situation has improved since.

The reason this happens is because Uber Eats and DoorDash and others have/had this concept where you’d “tip” for the delivery. Which is actually not a tip, but just a shitty way of disguising delivery fees and putting customers against the people that deliver the food. But that in turn has its background in how the restaurant business treats their workers in the USA, which has been wacky even long before these food delivery apps became a thing.

Anyway, regardless of your opinion on “tipping” and these practices the point was to say that there are additional complications with how much time it will take for your order to arrive aside from just the time it takes to prepare the food and the time it takes to travel from the restaurant to your door, even when the food has been prepared and a delivery driver is right there at the restaurant. If the “tip” is too low, or zero, your order could be left sitting on the shelf with nobody willing to pick it up. At least a few years ago it was like that.

What about it's ability to choose pricing based on source-destination and projected incomes.
All you need for this is a dictionary of zip codes and a rating -- normal, high, very high. Given that ZIPs are 5 digits, that's 100,000 records max, just keep it in memory, you don't even need entries for the "normal" ZIPs. Even if you went street-level, I doubt you'd catalog more than a few hundred thousand streets whose income is significantly more than the surrounding area.

All of this ignores the fact that adjusting a restaurant's prices by the customer's expected ability to pay often leads to killing demand among your most frequent and desirable clientele, but that's a different story.

I thought “humongous quantities of data” was a baseline assumption for a discussion involving clickhouse et all?
It was a genuine question. I was really curious about other use cases besides the obvious one.