Hacker News new | ask | show | jobs
by gwern 2235 days ago
Eh. It's built on Transformers, and people have already demonstrated considerable model distillation/compression on those just like every other kind of NN, and as they note, once you've trained a teacher model, you can probably train a wide flat model for similar results. (As I recall, WaveNet used to be similarly slow, but even without the parallel WaveNet retraining, with proper caching of repeated states, you could make it orders of magnitude faster and approach realtime.)