Hacker News new | ask | show | jobs
by muskmusk 1033 days ago
That is impressive. Interesting that the prefill (i am guessing this is prompt processing) is so much slower than decoding.

Its my understanding that under normal circumstances decoding is memory bandwidth bound which prompt processing isn't due to batching. Is there some quirk in your setup?

1 comments

Strange. I'm running Llama2 70b on Chrome Canary on a 64GB MacBook M1 Max...~1.5 older...and seeing better performance.

It's slow but usable!

prefill: 2.1963 tokens/sec, decoding: 3.4708 tokens/sec