| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jezzarax 741 days ago
	llama.cpp + llama-3-8b in Q8 run great on a single T4 machine. Cannot remember the TPS I got there, but it was much above 6 mentioned in the article.

1 comments

veryrealsid 741 days ago

Interesting, I got very different results depending on how I ran the model, will definitely give this a try!

edit: Actually could you share how long it took to make a query? One of our issues is we need it to respond in a fast time frame

link

jezzarax 741 days ago

I checked some logs from my past experiments, the decoding went for about 400 tps over a ~3k token query, so about 7 seconds to process it, and then the generation speed was about 28 tokens.

link