Hacker News new | ask | show | jobs
by threeseed 2899 days ago
I've been using Spark for many years going back to 1.0.

It is the foundation technology of almost every data science team around the world. And your misguided post (which for some weird reason only focuses on graph algorithms) doesn't change that. And not sure why you think it's inefficient. We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

2 comments

> We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

I have exactly zero knowledge on Spark's efficiency as well as zero on how representative graph algorithms are, but I can confidently say that the above statement fails to refute the referenced article's thesis (which, arguably, criticizes assertions just like that).

Just because your implementation scales (even autoscales) to use more compute resources says nothing about its efficiency (overall or even marginal when adding more nodes, i.e. the shape of the curve).

Computer science has struggled with achieving even near linear-scalability ever since the advent of SMP.

Spark is significantly more efficient than Hadoop.

I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.

I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.

It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.