| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by formalreconfirm 391 days ago
	That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size.

2 comments

dkdcio 391 days ago

Huge "it depends", but typically organizations are not querying all of their data at once. Usually, they're processing it in some time-based increments.

Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.

mrbungie 391 days ago

You can run a fleet of DuckDB instances and process data in a partitioned way.

Yes, there must be some use cases where you need all the data loaded up and addressable seamlessly across a cluster, but those are rare and typically FAANG-class problems.