|
|
|
|
|
by sp332
875 days ago
|
|
The extra precision is more useful for training. Once the network is optimized, it's a statistical model and only needs enough precision to make good guesses. In fact, one of the big papers on this also pointed out that you can drop about 40% of the weights completely. I think people generally skip that part because sparse matrix operations are slower, so it doesn’t help here. |
|