Hacker News new | ask | show | jobs
by morphle 538 days ago
Yes, knowing the exact CPU and ANE assembly instructions (or the underlying microcode!!) allows for general purpose software to adaptively compile processes on all the core types, not just the CPU ones. Its won't always be faster, you get more cache misses (some cores don't have cache) and different DMA and thread scheduling, some registers can't fit the floats or large integers, etc etc.

But yes, it will be possible to use all 140 cores of the M2 Ultra or the 36 cores of the M4. There will be an M6 Extreme some day, maybe 500 cores?

Actually, the GPU and ANE cores themselves are built from teams of smaller cores, maybe a few dozens, hundreds or thousand in all, same as in most NVDIA chips.

>A steal for $22k but I guess very niche for now...

A single iPhone or Mac app (a game, an LLM, pattern recognition, security app, VPN, de/encryption, video en/dec coder) that can be sped up by 80%-200% can afford my faster assembly level API.

A whole series of hardware level zero-day exploits for iPhone and Mac would become possible, now that won't be very niche at all. It is worth millions to reverse Apple Silicon instruction sets.

1 comments

What would a "llvm compilable" hello world look like that matches the libc GPU example for "AGX" (Apple Graphics)? It's not possible from MacOS, right? It'd have to be done from Linux?
No, I don't think it is impossible for MacOS. I might be missing a detail here, not sure. I have to think it over.

I have seen [1] you can patch ANECompilerService, so you can even speed up existing code, because Apple compiles your code just in time (at runtime) on each machine. We could do that for MacOS libc too.

[1] Some how-to hints in https://discussions.apple.com/thread/254758525?sortBy=rank

How do you issue/execute "GPU" machine code instructions from MacOS not through Metal?
You (or your compiler) write the instructions and data into unified memory (up to 192 GB) and jump to the first instruction (usually of a loop) on each core. GPU and ANE processor cores are not fundamentally different from CPU cores, they just have fewer transistors (gates) and therefore more limitations in what a register can address, what data type or what instruction it can execute. Some cores can only execute the same instruction as there neighbor core in a team, but on different data. Or at a different time, synchronized with neighbors. But they still are Turing complete processors so in essence are the same as their cousins the CPU cores. Sometimes cores input or output addresses are in a pipeline between cores (so it limits its address offset).

MacOS only plays a role in allocating and protecting the instruction or data memory regions for the GPU and ANE processors.

I would like to discuss this more, shot you an email at the one listed here.