|
|
|
|
|
by sillysaurusx
1960 days ago
|
|
Performance is rarely the issue, at least for us. The problem is, when the algorithms don’t work, then what? A batch size of 1 != an average of larger batch sizes. It’s why the BigGAN paper reports “bigger batches = better FID.” This proposal gives the advantage of a small batch size (and there are advantages) without sacrificing the option of large batches. |
|
Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.