|
|
|
|
|
by ActivePattern
252 days ago
|
|
I don't think you've understood the paper. - There are no experts. The outputs are approximating random samples from the distribution. - There is no latent diffusion going on. It's using convolutions similar to a GAN. - At inference time, you select ahead-of-time the sample index, so you don't discard any computations. |
|
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).