Hacker News new | ask | show | jobs
by jensb1 1015 days ago
Could someone explain how token generation speed relates to latency for the first token to be outputted?

And if anyone have any metrics on latency on a 4090 for the 70B model, that would be very helpful.

2 comments

4090Ti? AFAIK that has never been released
typo
Unrelated. What matters for that is prompt processing time (which is in the high hundreds of tokens per second).