|
|
|
|
|
by Twirrim
130 days ago
|
|
I get tolerable performance out of a quantized gpt-oss 20b on an old RTX3050 I have kicking around (I want to say 20-30 tokens/s, or faster when cache is effective). It's appreciably faster on the 4060. It's not quite ideal for more interactive agentic coding on the 3050, but approaching it, and fitting nicely as a "coding in the background while I fiddle on something else" territory. |
|
https://github.com/ggml-org/llama.cpp/discussions/15396 a guide for running gpt-oss on llama-server, with settings for various amounts of GPU memory, from 8GB on up