|
|
|
|
|
by pedrovhb
1054 days ago
|
|
I've been playing with running some models on the free tier Oracle VM machines with 24GB RAM and Ampere CPU and it works pretty well with llama.cpp. It's actually surprisingly quick; speed doesn't scale too well with the number of threads on CPU, so even the 4 ARM64 cores on that VM, with NEON, run at a similar speed to my 24-core Ryzen 3850X (maybe about half reading speed). It can easily handle Llama 2 13B, and if I recall correctly I did manage to run a 30B model in the past too. Speed for the smaller ones is ~half reading speed or so. It's a shame the current Llama 2 jumps from 13B to 70B. In the past I tried running larger stuff by making a 32GB swap volume, but it's just impractically slow. |
|
Also its really tricky to even build llama.cpp with a BLAS library, to make prompt ingestion less slow. The Oracle Linux OpenBLAS build isnt detected ootb, and it doesn't perform well compared to x86 for some reason.
LLVM/GCC have some kind of issue identifying the Ampere ARM architecture (march=native doesn't really work), so maybe this could be improved with the right compiler flags?