| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by badFEengineer 913 days ago
	This was surprisingly fast, 276.27 T/s (although Llama 2 70B is noticeably worse than GPT-4 turbo). I'm actually curious if there's good benchmarks for inference tokens per second- I imagine it's a bit different for throughput vs. single inference optimization, but curious if there's an analysis somewhere on this edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there

1 comments

andygeorge 913 days ago

fast but wrong/gibberish

link

razorguymania 913 days ago

Its using vanilla llama-2 from Meta with no fine tuning. The point here is the speed and responsiveness of the underlying HW and SW.

link

chihuahua 913 days ago

But if the quality of the response is poor, it's irrelevant that it was generated quickly. If it was using different data to generate higher quality responses, would that not slow it down?

link

tome 912 days ago

nomel gave a good answer in a different thread

> This is not about the model, it’s about the relative speed improvement from the hardware, with this model as a demo.

To compare apples to apples look at the tokens per second of other systems running Llama 2 70B 4096. We're by far the fastest!

https://news.ycombinator.com/item?id=38742466

link

andygeorge 913 days ago

Do you work there? Just curious

link

razorguymania 913 days ago

yes