| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isignal 443 days ago
	Aren't the alternatives you mentioned - icerberg and duckdb - both storage solutions while spark is a way to express distributed compute? I'm a bit out of touch with this space, is there a newer way to express distributed compute?

5 comments

mritchie712 443 days ago

duckdb is primarily a query engine. It does have a storage format, but one of it's strengths is querying data where it already resides (e.g. a parquet file sitting in S3).

There are some examples[0] of enabling DuckDB to manage distributed workloads, but these are pretty experimental.

0 - https://www.definite.app/blog/smallpond

link

isignal 443 days ago

Thanks for the pointers!

link

robertlacok 443 days ago

I think what many people are finding out is they don’t really need distributed processing. DuckDB on a single node can get you really far, and it’s much simpler.

link

tomjakubowski 443 days ago

DuckDB is not only a storage solution. It can directly query a variety of file formats at rest, without having to re-store anything. That's one of its selling points: you can query across archival/log data stored in S3 (or wherever) without needing to "ingest" anything or double-pay to duplicate the data you've already stored.

link

steve_adams_86 443 days ago

I’m just getting into DuckDB lately and finding this feature so exciting. It’s a totally new paradigm. Such a great tool for scientists, and probably many other people. I wish I took it seriously sooner.

link

winwang 443 days ago

Not a new way like Ray, but a new way to express Spark super-efficiently (GPU-acceleration): https://news.ycombinator.com/item?id=43964505

link

Nate75Sanders 443 days ago

Flink. It has more momentum than Spark right now.

link

mgfist 443 days ago

"momentum" is a tricky word. Zig has more momentum than C++, but will it ever overtake the language? I'd bet not.

link

franktankbank 442 days ago

Well its not a tricky word it just wrong. Velocity maybe. Or more probably acceleration.

link

lamp_book 443 days ago

Flink is designed around streaming first, while Spark is built around batch first and you're likely best off selecting accordingly. Though any streaming application likely needs batch processing to some degree. Latency vs throughput.

link