| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by conjectures 3777 days ago

It's an MCMC algorithm for a fancy kind of matrix factorisation.

I had a look at Spark, but its linear algebra packages seemed too limited (I guess abstraction comes at a cost). I can see that Spark would be nice if it does what you need out of the box.

Heard good things about Scala, is it straightforward to get a process on a remote machine to execute code?

1 comments

lmm 3777 days ago

> I had a look at Spark, but its linear algebra packages seemed too limited (I guess abstraction comes at a cost). I can see that Spark would be nice if it does what you need out of the box.

Did you look at MLlib and/or just using Breeze directly? There's a bit of awkwardness in the initial set up of the cluster (mainly just having LAPACK installed on all nodes, see https://spark.apache.org/docs/1.1.0/mllib-guide.html ). Spark itself is essentially just sugar to let you write a map/reduce in natural scala style and have it distributed across a cluster - it'll only work if you can factor your algorithm in a way that fits into that paradigm. (I've heard arguments that it's possible to do that with any distributable algorithm if you're clever enough, but I'm not sure I believe them).

> Heard good things about Scala, is it straightforward to get a process on a remote machine to execute code?

Honestly, no. I love the language but Spark is very much what I think of (perhaps unfairly) as typical scientific software. Spark clusters are finicky - they're cobbled together from a few unrelated projects (especially for cases where you need LAPACK as well), and it shows, especially when it comes to updating them. There are a few organizations like Cloudera (I think there was an open-source effort under the Apache umbrella somewhere too) that try to provide a working package, and various efforts with Puppet/Chef/etc. to automate the process of putting a cluster together, and it's certainly a lot better than it was even a few years ago, but a cluster still need at least a little bit of dedicated sysadmin time (or, at a bare minimum, a programmer with a bit of *nix admin experience who's willing to get their hands dirty - that was me at times) to keep it running reliably.

If you're part of an institution that already maintains a Spark cluster - or maintains an ordinary Hadoop cluster and you're friendly enough with the sysadmins to suggest they install it - it's wonderful. If you're having to do it all from scratch I won't lie, it's going to involve a lot of fiddling and may well not be worth it for your problem.

link

acidflask 3777 days ago

Most people don't need more than a handful of linear algebra operations (or think they don't), so Breeze and most wrappers of LAPACK or similar libraries don't implement or wrap them. But most people who work seriously on numerical routines will quickly run into performance problems if all they do is call LAPACK routines for general matrices instead of taking advantage of matrix structure.

I have yet to come across any other linear algebra library for any other high level language that provides the depth of integration available in the Julia base library. Want all eigenvalues of a symmetric tridiagonal 10x10 matrix between 1.0 and 12.0? Simply call T=SymTridiagonal(randn(10), randn(9)); eigvals(T, 1.0, 12.0). Or if you want to work closer to LAPACK, simply call LAPACK.stein!. I don't see a wrapper in Breeze or SciPy for this function. Want an LU factorization on a matrix of high precision floats? lufact(big(randn(5,4))). And so on.

Julia may not have everything users want, but its base library really tries to make matrix computations easy and accessible.

link

lmm 3777 days ago

Something like the Scala type system seems like the best way to keep track of that kind of structure information and make use of it (perhaps even transparently). I can easily believe the current wrappers aren't there yet though. (Afraid I switched jobs six months ago and haven't been using Breeze or Spark since, so I can't justify working on it myself at the moment)

link

conjectures 3777 days ago

+1 this kind of issue is why I went for Julia - the support for lin alg (including with CUDA) is very good indeed.

The other issue being that Julia gives fine grained control over a cluster in a way something more abstract couldn't. (After cobbling together a scripting-style map reducer based on the default functionality - ClusterUtils.jl.)

link

conjectures 3777 days ago

Cheers, this was interesting. I think my 'try spark' button would get pushed if I had to do a big job using a standard method for a company e.g. some massive GLM.

link