|
|
|
|
|
by SparkyMcUnicorn
1023 days ago
|
|
34B Q4 will use around 20GB of memory. If it's running slow, make sure metal is actually being used[0]. You can get as much as a 50-100% boost in tokens/s, if by chance it's not enabled. I'm averaging 7 to 8 tokens/s on an M1 Max 10 core (24 GPU cores). [0] if using llama-cpp-python (or text-generation-webui, ollama, etc) try: `pip uninstall llama-cpp-python && CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python` |
|
However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.
I wonder if there's a way to avoid that slowdown?