|
|
|
|
|
by jmanhype
108 days ago
|
|
I'm not an ML engineer. I used Claude Code (Opus 4.6) to get LoRA fine-tuning gradients running on Apple's Neural Engine — the dedicated ML chip in every Apple Silicon Mac that has no public training API. 192 gradient dispatches, zero GPU fallbacks, converging loss, all at ~2.8W. Three discoveries found through iteration on real hardware: (1) ANE's matmul op compiles but never executes — everything must be rewritten as 1x1 convolutions, (2) spatial dimensions must be multiples of 16, (3) the ANE compiler leaks handles and silently fails after ~119 compiles. Built on maderix's ANE reverse engineering work. The repo includes the full MIL kernel generator, subprocess isolation for the compile limit, and integration with MLX for hybrid GPU+ANE training. |
|