Hacker News new | ask | show | jobs
by angoragoats 643 days ago
You could for sure, but the nVidia setup described in this article would be many times faster at inference. So it’s a tradeoff between power consumption and performance.

Also, modern GPUs are surprisingly good at throttling their power usage when not actively in use, just like CPUs. So while you need 3kW+ worth of PSU for an 8x3090 setup, it’s not going to be using anywhere near 3kW of power on average, unless you’re literally using the LLM 24x7.

3 comments

Even if you are running it constantly, the per token power consumption is likely going to be in a similar range, not to mention you'd need 10+ macs for the throughput.
I have a 3090 power capped at 65%, I only notice a minimal difference in performance
Can Reflection:70b work on them?
Pretty sure it'll work where any 70b model would, but it's probably not noticably better than Llama 3.1 70b if the reports I'm reading now are correct.[1]

[1]https://x.com/JJitsev/status/1832758733866222011

Maybe you meant to reply to a different comment? Work on what?

Edit: I guess to directly answer your question, I don’t see why you couldn’t run a 70b model at full quality on either a M2 192GB machine or on an 8x 3090 setup.