| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by d_burfoot 1830 days ago

There are two lessons you could learn from this episode:

1. Use shallow trees and the clever workaround presented in the article.

2. Don't use Spark for tasks that require complex logic.

People should trace out the line of reasoning that leads them to use tools like Spark. It is convoluted and contingent - it goes back to work done at Google in the early 2000s, when the key to getting good price / performance was using a large number of commodity machines. Because they were cheap, these machines would break often, so you needed some really smart fault tolerance technology like Hadoop/HDFS, which was followed by Spark.

The current era is completely different. Now the key to good price / performance is to light up machines on-demand and then shut them down, only paying for what you use - and perhaps using the spot market. You don't need to worry about storage - that's taken care of by the cloud provider, and you can't "bring the computation to the data" like in the old days, removing one of the big advantages of Hadoop/HDFS. Because they are doing mostly IO and networking, and because computers are just more resilient nowadays, jobs rarely fail because of hardware errors. So almost the entire rationale that led to Hadoop/HDFS/Spark is gone. But people still use Spark - and put up with "accidentally exponential behavior" - because the tech industry is do dominated by groupthink and marketing dollars.

3 comments

thundergolfer 1830 days ago

Exactly correct. I’ve got a post in the works called “Elegy for Hadoop” that traces the history back to the early 2000s and arrives at the present day where you can easily get on-demand instances with 500Gb of RAM and use it for only your application’s lifetime. If you want 1000Gb instead of 500gb it does not cost 5x it costs 2x, significantly invalidating the “need to use excess commodity hardware” premise of the distributed map reduce architecture.

Edit: I don’t mean to suggest that there is no reason to use Spark, but ~95% of the usage in industry is unnecessary now and should be avoided.

link

mjburgess 1829 days ago

Is there anything you can say about Spark for Data Engineering (/ETL) ?

The most common reason for spark use today is ETL+DataLakes (ie., cloud object stores and ETL in/out).

It seems actual analysis is happening in fast databases that receive data from the object stores.

can anyone here comment on this paradigm?

link

xentripetal 1829 days ago

I don't have much insight into spark but I've been using Dataflow/beam for ETL. Been a pretty good experience. follows the style of spinning up compute to process as needed then shutdown.

link

xiaodai 1830 days ago

i predicted 99.99%

link

paulbaumgart 1830 days ago

Is there an alternative you’d recommend?

link

thundergolfer 1830 days ago

Check out Frank McSherry’s COST (Configuration that Outperforms a Single Thread) and see if you are just better off with a single fat machine[1].

1. https://www.usenix.org/system/files/conference/hotos15/hotos...

link

KptMarchewa 1829 days ago

Premise of the article is very true, but the comparison itself is very biased and dishonest.

Graph problems are famously hard to scale horizontally, and represent very small percent of what people use those big data systems for. Especially if you can fit the data in RAM...

Anyway, if you're able to run your workload on a single machine, then definitely do it.

link

thundergolfer 1829 days ago

I basically agree with you. Linked COST because of the premise and the upshot of the paper, which is totally valid.

link

ngc248 1829 days ago

Spark is still the best for stream processing use cases and if you have enough volume of data coming in something like spark is still the best for batch processing. '

link

KptMarchewa 1829 days ago

>Spark is still the best for stream processing use cases

No, Flink is much better.

link