Hacker News new | ask | show | jobs
by Eugr 3670 days ago
I agree with author that [in most cases] you don't need distributed processing for your algorithms. But sometimes you do, and when you do need it you have to understand that there is no silver bullet.

Creating a distributed system is very difficult, even when using platforms like Spark. Not all algorithms can be scaled easily or scaled at all, and not all algorithms in Spark MLLib or GraphX are actually designed to be truly distributed or work equally well for all use cases/data.

We tried to implement one of our algorithms (written in Java) that was taking hours on a single machine (even when using all the cores) using methods from Spark MLLib just to find that Spark job was constantly crashing. Turned out that some of the functions just fetch all the data to the "driver" instance and calculate the result there.

My guess this is what happens with author's use case - yes, he ran it on Spark, but only one node ended up crunching all numbers. And/or network overhead of course.

After we found out that MLLib can't give us what we need, we reimplemented it from scratch in Scala, making sure we balance the load equally and don't have too much network (shuffle) traffic between the nodes.

As a result, we went from 2.5 hours on a single machine, to under 2 minutes on a cluster of 25 instances (same Ivy Bridge processor, just more cores per node). The algorithm scaled almost linearly, but it required carefully designing it with Spark specifics in mind.

1 comments

NLLib is ostensibly open source, why not improve it? Was your final solution too specialized?
Yes, we ended up implementing just project specific parts, not generic enough for MLLib contribution...