|
|
|
|
|
by genewitch
460 days ago
|
|
My desktop GPU can run small models at 185 tokens a second. Larger models with speculative decoding: 50t/s. With a small, finetuned model as the draft model, no, this won't take much power at all to run inference. Training, sure, but that's buy once cry once. Whether this means it's a good idea, I don't think so, but the energy usage for parsing isn't why. |
|