For those interested in trying this out in Python:
* `gensim` contains stochastic SVD for large data (fast online model training) [2]
* I wrote a benchmark of (approximate) nearest neighbour libraries in Python [3]
[1] https://dl.dropboxusercontent.com/u/2143857/papers/topics.pd...
[2] https://github.com/piskvorky/gensim/
[3] http://radimrehurek.com/2013/12/performance-shootout-of-near...