| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by EdwardDiego 3118 days ago
	Can anyone comment on his point about Spark's ML libs? I note that was from last year (about 2015 code), not sure what level of beta they were at, but yeah, I use it for batch processing, but have never used the ML aspects, so just curious. > And even up to last year, there’s just massive bugs in the machine learning libraries that come bundled with Spark. It’s so bizarre, because you go to Caltrain, and there’s a giant banner showing a cool-looking data scientist peering at computers in some cool ways, advertising Spark, which is a platform that in my day job I know is just barely usable at best, or at worst, actively misleading.

1 comments

Radim 3118 days ago

Getting better obviously, but the feet-on-the-ground experience for MLlib is still far from pleasant: hard to configure, hard to manage, hard to scale, hard to debug.

By way of anecdote, Spark's MLlib used to contain an implementation of word2vec that failed when used on more than 2 billion words (some arcane integer overflow). So much for scale!

As for performance, in 2016, the break-even point where a Spark cluster started being competitive with a single-machine implementation was around 12 Spark machines (a bit of a hindrance to rapid iterative development, which is the corner stone of R&D): https://radimrehurek.com/florence15.pdf

link

kwisatzh 3116 days ago

Can you be more specific in terms of issues with ML Lib? I'm thinking of using it with Spark cause of big data requirements, but have heard MLLib in particular is highly unreliable.

link

blueplastic 3117 days ago

lol, that PDF is referencing Spark 1.3 from March 2015 and to say that you need 12 modern Spark machines to break-even with one machine running a non-distributed ML framework is ridiculously wrong. And he wan Spark on EMR, which was pretty unoptimized back then.

link