Also not familiar with llamafiles, but if it uses llama.cpp under the hoods, it can probably make use of mmap to avoid fully loading on each run. If the GPU on Macs can access the mmapped file, then it would be fast.
Author here. It does make use of mmap(). I worked on adding mmap() support to llama.cpp back in March, specifically so I could build things like Emacs Copilot. See: https://github.com/ggerganov/llama.cpp/pull/613 Recently I've been working with Mozilla to create llamafile, so that using llama.cpp can be even easier. We've also been upstreaming a lot of bug fixes too!