|
|
|
|
|
by formalreconfirm
391 days ago
|
|
That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size. |
|
Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.