The moment I heard the synopsis of the technique, I thought of one thing: Style transfer.
This model style should be really nice for translation and style transfer tasks. Takes an existing section text, noises it, and reverses it with guidance like an image diffusion model; A "movement" in latent with a controllable amount of modifications.
The diffusion process enables a wide range of "control" approaches not possible with current transformer models. Perhaps summarizing text can be done differently as well, taking an input and diffusing it into a shorter and shorter section.
I've not been this hyped about a new method since GPT3 itself.
Volodymyr, congrats. This is crazy fast. If not super great at long context coding tasks. I tagged a few problem responses.
I'm curious about something that has analogues in image diffusion models -- you can see diffusion models, depending on how they are working through their latent space, sometimes try out and then move on from a feature in an image as it fits less with what's around it.
Are there analogues for Mercury? Does it try with a token or set of tokens, and as parts of the response fill in move on from them? Similarly, this architecture seems like it would have real problems inserting a needed token in the middle of a bunch of relatively high confidence generated tokens.
Can you give some insight / thoughts from the frontlines on these?
Good question! We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Super cool, and I'd love to play around with this if they release an open source version.
Without a full paper, it's a bit hard to understand the full details. Does this essentially replace nucleus sampling with diffusion, or does it change the "core" transformer architecture in a major way?
Yes, we plan to be releasing a tech report soon. We are not open sourcing the models at launch time, but we have a roadmap of future releases in which we hope to make some of our models accessible to the research community.
Probably it's not relevant to you commercially at the moment (or ever?), but would love some intuition on how your models perform on really low end hardware. Does this technique translate into improved CPU-only performance? Also curious about density, does the technique require more/fewer/roughly same parameters as a traditional LLM for the same output quality?
Great question! The model can more efficiently leverage existing GPU hardware---it performs more computation per unit of memory transferred; this means that on older hardware one should be able to get similar inference speeds as one would get on recent hardware with a classical LLM. This is actually interesting commercially, since it opens new ways of reducing AI inference costs.
Assuming the model tracks convergence in one way or another, it would simply continue performing iterations until it has reached an error below an epsilon value.
This means that in the worst case the number of iterations is the same as a classic autoregressive transformer.
So they are mostly taking advantage of the fact that the average response is in reality not fully sequential, so the model is discovering the exploitable parallelism on its own.
This is not too dissimilar to a branch and bound algorithm that has a worse theoretical runtime than a simple brute force search, but in practice is solving the integer linear programming problem in almost polynomial time, because not everyone is encoding the hardest instances of problems in NP as integer linear programs.
The short answer is that we do more than one parallel pass over multiple tokens: we iteratively refine them over a few passes to fix incoherences. This can be seen as a generalization of diffusion algorithms that underlie systems like Midjourney or Sora.
The moment I heard the synopsis of the technique, I thought of one thing: Style transfer.
This model style should be really nice for translation and style transfer tasks. Takes an existing section text, noises it, and reverses it with guidance like an image diffusion model; A "movement" in latent with a controllable amount of modifications.
The diffusion process enables a wide range of "control" approaches not possible with current transformer models. Perhaps summarizing text can be done differently as well, taking an input and diffusing it into a shorter and shorter section.
I've not been this hyped about a new method since GPT3 itself.