| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by corn13read2 801 days ago
	A macbook is cheaper though

6 comments

tgtweak 801 days ago

The extra $3k you'd spend on a quad-4090 rig vs the top mbp... ignoring the fact you can't put the two on even ground for versatility (very few libraries are adapted to apple silicone let alone optimized).

Very few people that would consider an H100/A100/A800 are going to be cross-shopping a macbook pro for their workloads.

link

LoganDark 801 days ago

> very few libraries are adapted to apple silicone let alone optimized

This is a joke, right? Have you been anywhere in the LLM ecosystem for the past year or so? I'm constantly hearing about new ways in which ASi outperforms traditional platforms, and new projects that are optimized for ASi. Such as, for instance, llama.cpp.

link

cavisne 801 days ago

Nothing compared to Nvidia though. The FLOPS and memory bandwidth is simply not there.

link

spudlyo 801 days ago

The memory bandwidth of the M2 Ultra is around 800GB/s verses 1008 GB/s for the 4090. While it’s true the M2 has neither the bandwidth or the GPU power, it is not limited to 24G of VRAM per card. The 192G upper limit on the M2 Ultra will have a much easier time running inference on a 70+ billion parameter model, if that is your aim.

Besides size, heat, fan noise, and not having to build it yourself, this is the only area where Apple Silicon might have advantage over a homemade 4090 rig.

link

LoganDark 801 days ago

It doesn't need GPU power to beat the 4090 in benchmarks: https://appleinsider.com/articles/23/12/13/apple-silicon-m3-...

link

int_19h 801 days ago

It doesn't beat RTX 4090 when it comes to actual LLM inference speed. I bought a Mac Studio for local inference because it was the most convenient way to get something fast enough and with enough RAM to run even 155b models. It's great for that, but ultimately it's not magic - NVidia hardware still offers more FLOPS and faster RAM.

link

LoganDark 801 days ago

Yeah. Let me just walk down to Best Buy and get myself a GPU with over 24 gigabytes of VRAM (impossible) for less than $3,000 (even more impossible). Then tell me ASi is nothing compared to Nvidia.

Even the A100 for something around $15,000 (edit: used to say $10,000) only goes up to 80 gigabytes of VRAM, but a 192GB Mac Studio goes for under $6,000.

Those figures alone proves Nvidia isn't even competing in the consumer or even the enthusiast space anymore. They know you'll buy their hardware if you really need it, so they aggressively segment the market with VRAM restrictions.

link

andersa 801 days ago

Where are you getting an A100 80GB for $10k?

link

LoganDark 801 days ago

Oops, I remembered it being somewhere near $15k but Google got confused and showed me results for the 40GB instead so I put $10k by mistake. Thanks for the correction.

A100 80GB goes for around $14,000 - $20,000 on eBay and A100 40GB goes for around $4,000 - $6,000. New (not from eBay - from PNY and such), it looks like an 80GB would set you back $18,000 to $26,000 depending on whether you want HBM2 or HBM2e.

Meanwhile you can buy a Mac Studio today without going through a distributor and they're under $6,000 if the only thing you care about is having 192GB of Unified Memory.

And while the memory bandwidth isn't quite as high as the 4090, the M-series chips can run certain models faster anyway, if Apple is to be believed

link

andersa 801 days ago

Sure, it's also at least an order of magnitude slower in practice, compared to 4x 4090 running at full speed. We're looking at 10 times the memory bandwidth and much greater compute.

link

chaostheory 800 days ago

Yeah, even a Mac Studio is way too slow compared to Nvidia which is too bad because at $7000 maxed to 192gb it would be an easy sell. Hopefully, they will fix this by m5. I don’t trust the marketing for m4

link

faeriechangling 800 days ago

Buying a MacBook for AI is great if you were already going to buy a MacBook, as this makes it a lot more cost competitive. It's also great if what you're doing is REALLY privacy sensitive, such as if you're a lawyer, where uploading client data to OpenAI is probably not appropriate or legal.

But in general, I find the appeal is narrow because either consumer GPUs are better for training in general and inferencing at scale[1]. Cloud services also allow the vast majority of individuals to get higher quality inferencing at lower cost. The result is Apple Silicon's appeal being quite niche.

[1] Mind you, Nvidia considers this a licensing violation, not that GeoHot has historically ever been all scared to violate a EULA and force a company to prove its terms have legal force.

So is a TI-89.

And looks way cooler

4x32GB(128GB) DDR4 is ~$250. 4x48GB(192GB) DDR5 is ~$600. Those are even cheaper than upgrade options for Macs($1k).

link

papichulo2023 801 days ago

No many consumer mobo support 192GB DDR5.

link

faeriechangling 800 days ago

Most consumer mobo's I see support this even if the setup isn't on the QVL. If a DDR5 motherboard support 4 sticks at all you can probably run 192gb on it so long as you update the BIOS firmware. The problem is running at rated speeds.

AMD tends to be worse than Intel, and I hear people having to run anywhere between DDR5-3200 to DDR5-5200. You are better off running two sticks, because even with 2 sticks you really can't run larger models with acceptable performance anyways, much less with 4.

There is competition to apple on the low end (dual channel fast DDR5) and on the high end (8+ channel like Xeon/Epyc/AmpereOne). In the middle, Apple is sort of crushing because if you run a true 4 channel system you're going to get poor performance if you load up a 192gb model, and if you compare pricing to 96gb/128gb apple systems, there's not all that much of a cost advantage and you have to make a lot of sacrifices to get there. The truth is that Apple really doesn't have all that much competition right now and won't for the foreseeable future.

link

papichulo2023 800 days ago

Hopefully Qualcom will free us of this 2 channels noghtmare.

link

wtallis 799 days ago

I don't think it's realistic to pin your hopes on Qualcomm given that they're unlikely to care about supporting anything other than LPDDR with their laptop processors.

link

faeriechangling 799 days ago

I’m optimistic about APUs personally like AMDs upcoming Strix Halo APU with a 256-bit memory bus competing at the lower end of the market, but that will only provide so much competition.

link

wtallis 801 days ago

If it supports DDR5 at all, then it should be at most a firmware update away from supporting 48GB dual-rank DIMMs. There are very few consumer motherboards that only have two DDR5 slots; almost all have the four slots necessary to accept 192GB. If you are under the impression that there's a widespread limitation on consumer hardware support for these modules, it may simply be due to the fact that 48GB modules did not exist yet when DDR5 first entered the consumer market, and such modules did not start getting mentioned on spec sheets until after they existed.

link

imtringued 800 days ago

You don't want to use more than two slots because you only have two memory channels. The overclocking potential of DDR5 is extremely high when you only run two DIMMs. All the way up to 8000. Meanwhile if you go for populating all four slots, you are limited significantly below 5000. Almost a 50% performance drop if you are willing to overclock your RAM.

link

wtallis 800 days ago

If you want to run something that doesn't fit in 96GB of RAM, you'll get better performance from having enough RAM. Yes, having two dual-rank DIMMs per channel will force you to run at a slower speed, but it's still far faster than your SSD. The second slot per channel exists precisely because many people really do want to use it.

link

ojbyrne 801 days ago

A lot that have specs showing they support a max of 4x32 DDR5 actually support 4x48 DDR5 via recent BIOS updates.

link

papichulo2023 800 days ago

In the specs yeap, in practice hardly anyone got it working. As far as I saw in reddit, it requires customizing timings to make 4 slots work over 6000 Mhz at the same time.

link

thangngoc89 801 days ago

training on MPS backend is suboptimal and really slow.

link

wtallis 801 days ago

Do people do training on systems this small, or just inference? I could see maybe doing a little bit of fine-tuning, but certainly not from-scratch training.

link

redox99 801 days ago

If you mean train llama from scratch, you aren't going to train it on any single box.

But even with a single 3090 you can do quite a lot with LLMs (through QLoRA and similar).

link

thangngoc89 800 days ago

Yep. Price/performance of multiple 4090s system are way better than the professional cards (Axxx). Also deep learning outside of LLM has many different usage.

link