|
|
|
|
|
by conjectures
3777 days ago
|
|
It's an MCMC algorithm for a fancy kind of matrix factorisation. I had a look at Spark, but its linear algebra packages seemed too limited (I guess abstraction comes at a cost). I can see that Spark would be nice if it does what you need out of the box. Heard good things about Scala, is it straightforward to get a process on a remote machine to execute code? |
|
Did you look at MLlib and/or just using Breeze directly? There's a bit of awkwardness in the initial set up of the cluster (mainly just having LAPACK installed on all nodes, see https://spark.apache.org/docs/1.1.0/mllib-guide.html ). Spark itself is essentially just sugar to let you write a map/reduce in natural scala style and have it distributed across a cluster - it'll only work if you can factor your algorithm in a way that fits into that paradigm. (I've heard arguments that it's possible to do that with any distributable algorithm if you're clever enough, but I'm not sure I believe them).
> Heard good things about Scala, is it straightforward to get a process on a remote machine to execute code?
Honestly, no. I love the language but Spark is very much what I think of (perhaps unfairly) as typical scientific software. Spark clusters are finicky - they're cobbled together from a few unrelated projects (especially for cases where you need LAPACK as well), and it shows, especially when it comes to updating them. There are a few organizations like Cloudera (I think there was an open-source effort under the Apache umbrella somewhere too) that try to provide a working package, and various efforts with Puppet/Chef/etc. to automate the process of putting a cluster together, and it's certainly a lot better than it was even a few years ago, but a cluster still need at least a little bit of dedicated sysadmin time (or, at a bare minimum, a programmer with a bit of *nix admin experience who's willing to get their hands dirty - that was me at times) to keep it running reliably.
If you're part of an institution that already maintains a Spark cluster - or maintains an ordinary Hadoop cluster and you're friendly enough with the sysadmins to suggest they install it - it's wonderful. If you're having to do it all from scratch I won't lie, it's going to involve a lot of fiddling and may well not be worth it for your problem.