|
|
|
|
|
by smpanaro
408 days ago
|
|
What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory). Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+. |
|
I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.
also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago