| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by laborcontract 914 days ago

This is really impressive. For reference, inference for llama 70b on together’s api generates text at roughly 60 tokens/second.

I can’t find any information about an api, though I’m guessing that the costs are eye watering.

If they offered a Mixtral endpoint that did 300-400 tokens per second at a reasonable cost, I can’t imagine ever using another provider.

1 comments

tome 914 days ago

We don't have an API in public availability yet but that's coming soon in the new year. We will be price competitive with OpenAI but much faster. Deploying Mixtral is work in progress so keep your eyes open for that too!

link

visarga 914 days ago

Also make a long context Mistral-7B that spits 1000T/s

link

tome 914 days ago

I'll do it if you promise to say "wow!" :D

link

tome 907 days ago

Here you go:

https://www.youtube.com/watch?v=9c078xKGwdU

It's 850 tokes per second, so you don't have to say "wow" yet!

link