Hacker News new | ask | show | jobs
by thundergolfer 1824 days ago
Exactly correct. I’ve got a post in the works called “Elegy for Hadoop” that traces the history back to the early 2000s and arrives at the present day where you can easily get on-demand instances with 500Gb of RAM and use it for only your application’s lifetime. If you want 1000Gb instead of 500gb it does not cost 5x it costs 2x, significantly invalidating the “need to use excess commodity hardware” premise of the distributed map reduce architecture.

Edit: I don’t mean to suggest that there is no reason to use Spark, but ~95% of the usage in industry is unnecessary now and should be avoided.

2 comments

Is there anything you can say about Spark for Data Engineering (/ETL) ?

The most common reason for spark use today is ETL+DataLakes (ie., cloud object stores and ETL in/out).

It seems actual analysis is happening in fast databases that receive data from the object stores.

can anyone here comment on this paradigm?

I don't have much insight into spark but I've been using Dataflow/beam for ETL. Been a pretty good experience. follows the style of spinning up compute to process as needed then shutdown.
i predicted 99.99%