|
|
|
|
|
by koningrobot
1958 days ago
|
|
We don't compute per-example gradients, so in your second code snippet there would not be a loop across examples. We compute the batch-averaged gradient in the same time it would take to compute a single example's gradient, so it's much more efficient than your proposal, which is equivalent to using a batch size of 1. |
|
A batch size of 1 != an average of larger batch sizes. It’s why the BigGAN paper reports “bigger batches = better FID.”
This proposal gives the advantage of a small batch size (and there are advantages) without sacrificing the option of large batches.