Hacker News new | ask | show | jobs
by Radim 3108 days ago
Getting better obviously, but the feet-on-the-ground experience for MLlib is still far from pleasant: hard to configure, hard to manage, hard to scale, hard to debug.

By way of anecdote, Spark's MLlib used to contain an implementation of word2vec that failed when used on more than 2 billion words (some arcane integer overflow). So much for scale!

As for performance, in 2016, the break-even point where a Spark cluster started being competitive with a single-machine implementation was around 12 Spark machines (a bit of a hindrance to rapid iterative development, which is the corner stone of R&D): https://radimrehurek.com/florence15.pdf

2 comments

Can you be more specific in terms of issues with ML Lib? I'm thinking of using it with Spark cause of big data requirements, but have heard MLLib in particular is highly unreliable.
lol, that PDF is referencing Spark 1.3 from March 2015 and to say that you need 12 modern Spark machines to break-even with one machine running a non-distributed ML framework is ridiculously wrong. And he wan Spark on EMR, which was pretty unoptimized back then.