|
|
|
|
|
by moralestapia
412 days ago
|
|
>Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips. This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff! Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way. |
|
They're running Qwen on a traditional LLM pipeline. The "diffusion effect", as it says there, it's just a decorative, lmao. That in itself shouldn't break the deal as I understand you have to put on a show, but, looking at the latency and timing of their outputs this is not a diffusion model, as they claim. They're also not even close to the 1,000 TPS figure they put out.
I'm surprised nobody on this forum got the slightest clue on that. I guess I should 4x my fee again :).