Running a 680-billion parameter frontier model on a few Macs (at 13 tok/s!) is nuts. That'a two years after ChatGPT was released. That rate of progress just blows my mind.
And those are M2 Ultras. M4 Ultra is about to drop in the next few weeks/months, and I'm guessing it might have higher RAM configs, so you can probably run the same 680b on two of those beasts.
The higher performing chips, with one less interconnect, is going to give you significantly higher t/s.
The higher performing chips, with one less interconnect, is going to give you significantly higher t/s.