|
|
|
|
|
by geerlingguy
169 days ago
|
|
I've just tried replicating this on my Pi 5 16GB, running the latest llama.cpp... and it segfaults: ./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
...
Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
Segmentation fault
I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.[Edit: had to add -c 4096 to cut down the context size, now it loads] |
|
llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!
On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s
On AMD Ryzen 7 H 255 w/ Radeon 780M Graphics and Compiled for Vulkan Got 40 t/s On the last I did a quick comparison with unsloth version unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and got 25 t/s Can't really comment on quality of output - seems similar