|
|
|
|
|
by zozbot234
91 days ago
|
|
With this work you can run a medium-sized model like GPT OSS 20b at native speed even while keeping those 32GB RAM almost fully available for other uses - the model seamlessly starts to slow down as RAM requirements increase elsewhere in the system and the fs cache has to evict more expert layers, and reaches full speed again as the RAM is freed up. It adds a key measure of flexibility to the existing AI local inference picture. |
|