|
|
|
|
|
by EdwardDiego
3118 days ago
|
|
Can anyone comment on his point about Spark's ML libs? I note that was from last year (about 2015 code), not sure what level of beta they were at, but yeah, I use it for batch processing, but have never used the ML aspects, so just curious. > And even up to last year, there’s just massive bugs in the machine learning libraries that come bundled with Spark. It’s so bizarre, because you go to Caltrain, and there’s a giant banner showing a cool-looking data scientist peering at computers in some cool ways, advertising Spark, which is a platform that in my day job I know is just barely usable at best, or at worst, actively misleading. |
|
By way of anecdote, Spark's MLlib used to contain an implementation of word2vec that failed when used on more than 2 billion words (some arcane integer overflow). So much for scale!
As for performance, in 2016, the break-even point where a Spark cluster started being competitive with a single-machine implementation was around 12 Spark machines (a bit of a hindrance to rapid iterative development, which is the corner stone of R&D): https://radimrehurek.com/florence15.pdf