Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.
Meaningless question, fit will put everything on the gpu if it fits. Fa is default on. No-mmap is not an inference tradeoff and if you do turn it off you need to turn on direct io via -dio
What he should actually do is enable speculative decoding
What he should actually do is enable speculative decoding