Hacker News new | ask | show | jobs
by utopcell 3484 days ago
The opposite is actually true.

If you train on the big dataset that is produced by demo-train-big-model-v1.sh (which includes news corpora from 2012 & 2013, the 1BN word dataset from statmt, the UMBC web corpus and the whole wikipedia) using only one thread, accuracy on the google analogy dataset drops to 68% (down from ~71.5% when using 20 threads.)

This is due to the learning rate algorithm used: Learning rate is linearly reduced with the number of processed words. When K threads are used, the input dataset is split into K parts, processed in parallel, which means that more parts of the dataset have a chance to influence the resulting vectors in the beginning of the computation (where learning rate is relatively high.)

2 comments

No it is not - as number of cores approaches infinity, the validation accuracy will approach zero, due to the lack of locking of shared memory. There is definitely a sweet spot in the number of cores for the original code, but it is not scalable to infinity. Therefore, it cannot utilize any number of cores.
Aligned float updates are atomic in all architectures that matter. Also, unsynchronized parameter updates for SGD have actually been studied in [1], where it was shown that they don't affect performance.

In the limit, performance would indeed suffer as all updates would happen in parallel.

[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in Neural Information Processing Systems. 2011.

There's another paper describing the "Hogbatch" approach that shows more exactly the effect of adding cores on accuracy: http://www.ece.ubc.ca/~matei/papers/ipdps16.pdf.

The summary would be that accuracy per pass suffers slightly, but since the speedup is close to linear for the first dozen or so cores, each pass is much faster to run. The result is that the wall time to achieve a given level of accuracy is much shorter despite the slightly lower accuracy per pass.

That just sounds like a poorly tuned algorithm/learning rate in the single threaded case. Certainly you can emulate the parallel algorithm sequentially if you wanted.
True, you could definitely emulate it by shuffling the input, but that's not what is being done unfortunately.