| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ibeckermayer 106 days ago
	Cool that it's possible but basically unusable performance characteristics. For an 8192 token prompt they report a ~1.5 minute time-to-first-token and then 8.30tk/s from there. For context ChatGPT is typically <<1s ttft and ~50tk/s.

2 comments

fc417fc802 106 days ago

Given that APU only has 4 channels isn't this setup comically starved for bandwidth? By the same token, wouldn't you expect performance to scale approximately linearly as you add additional boxes? And wouldn't you be better off with smaller nodes (ie less RAM and CPU power per box)?

If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.

link

ibeckermayer 105 days ago

Good thought... I think you're wrong because the dominant factor is bandwidth over the interconnect. In this case they're using 5Gbps over Ethernet; compare that to 80-120 Gbps for a Thunderbolt 5 connected Mac Studio cluster: https://www.youtube.com/watch?v=bFgTxr5yst0

link

fc417fc802 105 days ago

> I think you're wrong because the dominant factor is bandwidth over the interconnect.

Is it? Why do you say that? I understand inference to be almost entirely bottlenecked on memory bandwidth.

There are n^2 weights per layer but only n state values in the vector that exists between layers. Transmitting a few thousand (or even tens of thousands) of fp values does not require a notable amount of bandwidth by modern standards.

Training is an entirely different beast of course. And depending on the workload latency can also impact performance. But for running inference with a single query from a single user I don't see how inter-node bandwidth is going to matter.

link

JKCalhoun 106 days ago

I've never understood the obsession with token/s. I'm fine with asking a question and then going on to another task (which might be making coffee).

Even with a cloud-based LLM where the response is pretty snappy, I still find that I wander off and return when I am ready to digest the entire response.

link

ibeckermayer 105 days ago

Your workflow is unusual, oftentimes there is a vigorous back and forth, or a desired output like code generation, etc where a low tk/s drastically effects ux and user productivity.

But the real kicker here is the 90s ttft, that means you ask a question and don't see anything for a full minute and a half.

link

nitinreddy88 106 days ago

You are fine with it. But may be rest of the world is not. Anyway, to compare performance/benchmark, we need metrics and this is one of the basic metric to measure.

link