|
|
|
|
|
by vvolhejn
247 days ago
|
|
I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio. The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2] [1] https://arxiv.org/abs/2005.00341 [2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi... |
|