Hacker News new | ask | show | jobs
by asimpleusecase 876 days ago
Did you do anything special to make that work? Is it useful? Or just a toy?
4 comments

I have a 14" MBP with an M1 Max and 64GB. The M3 won't really make a difference, but the RAM, since unified, is huge. I can run most models on this machine with realtime performance compared to a Ryzen 7735HS and 64GB (DDR5). Now I'm not saying the Ryzen setup should be good, but the M1 architecture just makes it a much better option. I could add an eGPU to the Ryzen system and it could likely do better, but would also exceed the price point and portability.
it's not just that it's huge and unified - ryzen APUs obviously can have 2x32GB SODIMMs put in them and they support unified memory too.

the difference is the bandwidth and the computational power of the APU. M1 Max is roughly similar to a PS5 in terms of overall system design (shader configuration and bandwidth) plus has dedicated AI inference units already (which won't be added to consoles until PS5 Pro launches with RDNA 3.5). It is far more bandwidth than you can get out of a socketed-memory laptop system.

https://twitter.com/Locuza_/status/1450271726827413508

To support that level of performance in a socketed-memory system you will need an extra layer of caching added to the processor to supplement the bandwidth - and maybe still need to go to quad-channel. Those products are Strix and Strix Halo and should be hitting the market over the next year or two but the reality is that the M1 Max was an absurdly powerful laptop, far more potent than even the first-gen 5nm laptops for x86 let alone the other junk you could buy in 2020.

This is the problem with the discourse around apple silicon for the last few years: yeah, they're expensive, but even a loaded-out x86 laptop doesn't get you the same capabilities. Even if the x86 is competitive in some particular benchmark on iso-node you are probably spending more power to do it, and the x86 product comes years after the apple product, and still has a much weaker gpu and less bandwidth (which doesn't just matter for GPU, it matters for compiling and JIT too).

It is incredibly silly to look back on the discourse in 2020-2023 around apple silicon, a lot of reviewers made extremely silly claims about how "even 7nm x86 processors were already competitive with apple silicon" and as the ecosystems have matured it is obvious that even 5nm processors are not quite competitive yet. And they dumped on the SPEC tests and Geekbench that measured this properly, in favor of dumb things like cinebench R23 and so on (it's always cinebench used for this dumb shit tbh, CB R13/R15 were hugely misleading at the zen1 launch too). Let alone things like, you know, compiling or JVM/node workloads...)

(similarly, gotta love the vibe a few years ago of: "threadripper vs mac pro" - did you know that a 64C threadripper with 256GB RAM is actually cheaper than a mac pro loaded out with 2TB!? waow, who knew systems with an order of magnitude less capacity would be cheaper!? https://youtu.be/BH291DQRIOg )

I've had less luck with Mixtral, but I run Yi 34B finetunes for general personal use, including quick queries for work.

Its kinda like GPT 3.5, with no internet access and slightly less reliable responses, but unrestrained, much faster and with a huge (up to 75K on my Nvidia 3090) usable context.

Mixtral is extremely fast though, at least at a batch size of 1.

Which Yi 34B finetunes are you using that have a 75,000 token length?
All of the Yi 200K finetunes should support it, but you have to be careful because some degrade the base model's quite excellent long context performance more than others. The very strong Bagel 34B DPO model, for instance, basically doesn't work at long context.

Nous Capybara is a popular one. I personally use my own merge of many models, and you can look through the constituent models to see if any interest you: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megame...

You can't really use llama.cpp for super long context btw, its just too slow and vram inefficient at the moment.

Nothing special other than llama.cpp, which is an inference engine optimized for apple silicon.

I heard you can simply install ollama app which uses llama.cpp under the hoods, but has a more user friendly experience.

I've been using it for 'easy' queries like syntax/parameter questions, in place of ChatGPT 4. It's great for that. I am using a ~48GB version.