| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by smpanaro 454 days ago
	What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory). Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.

1 comments

conradev 454 days ago

I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?

I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.

also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago

link

smpanaro 454 days ago

For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.

Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.

Awesome to hear about ModernBERT! Big fan of your work as well :)

link

anemll 454 days ago

Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.

link