Hacker News new | ask | show | jobs
by isignal 397 days ago
Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?
5 comments

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

Thanks for the pointers!
I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.
DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.
I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.
Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505
Flink. It has more momentum than Spark right now.
"momentum" is a tricky word. Zig has more momentum than C++, but will it ever overtake the language? I'd bet not.
Well its not a tricky word it just wrong. Velocity maybe. Or more probably acceleration.
Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.