|
|
|
|
|
by roosgit
505 days ago
|
|
Start with r/LocalLLama and r/StableDiffusion. Look for benchmarks for various GPUs. I have an RTX 3060(12GB) and 32GB RAM. Just ran Qwen2.5-14B-Instruct-Q4_K_M.gguf in llama.cpp with flash attention enabled and 8K context. I get get 845t/s for prompt processing and 25t/s for generation. For a while I even ran llama.cpp without a GPU (don't recommend it for diffusion) and with the same model (Qwen2.5 14B) I would get 11t/s for processing and 4t/s for generation. Acceptable for chats with short questions/instructions and answers. |
|