| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by craydandy 590 days ago

Interesting and well-written article. Thanks to the author for writing it. Replacing Spark with these single-machine tools seems to be on the hype, and Spark is not en vogue anymore.

The author ran Spark in Fabric, which has V-Order write enabled by default. DuckDB and Polars don't have this, as it's an MS proprietary algorithm. V-Order adds about 15% overhead to write, so it does change the result a bit.

The data sizes were bit on a large size, at least for the data amounts I see daily. There definitely are tables in the 10GB, 100GB, and even in 1TB size range, but most tables traveling through data pipelines are much smaller.

1 comments

mwc360 590 days ago

FYI I had V-Order and Optimzed Write disabled in the benchmark. The only wrote diff was that I enabled deletion vectors in Spark since it’s supported which the other two don’t.

link

craydandy 589 days ago

Thanks for the clarification. I didn't see it in the article.

link