Hacker News new | ask | show | jobs
by laborcontract 914 days ago
This is really impressive. For reference, inference for llama 70b on together’s api generates text at roughly 60 tokens/second.

I can’t find any information about an api, though I’m guessing that the costs are eye watering.

If they offered a Mixtral endpoint that did 300-400 tokens per second at a reasonable cost, I can’t imagine ever using another provider.

1 comments

We don't have an API in public availability yet but that's coming soon in the new year. We will be price competitive with OpenAI but much faster. Deploying Mixtral is work in progress so keep your eyes open for that too!
Also make a long context Mistral-7B that spits 1000T/s
I'll do it if you promise to say "wow!" :D
Here you go:

https://www.youtube.com/watch?v=9c078xKGwdU

It's 850 tokes per second, so you don't have to say "wow" yet!