|
|
|
|
|
by utopcell
3484 days ago
|
|
The opposite is actually true. If you train on the big dataset that is produced by demo-train-big-model-v1.sh (which includes news corpora from 2012 & 2013, the 1BN word dataset from statmt, the UMBC web corpus and the whole wikipedia) using only one thread, accuracy on the google analogy dataset drops to 68% (down from ~71.5% when using 20 threads.) This is due to the learning rate algorithm used: Learning rate is linearly reduced with the number of processed words. When K threads are used, the input dataset is split into K parts, processed in parallel, which means that more parts of the dataset have a chance to influence the resulting vectors in the beginning of the computation (where learning rate is relatively high.) |
|