|
|
|
|
|
by pgkr
497 days ago
|
|
Are those outputs actually from the 671B model? The 671B model needs 8xH200 GPUs at minimum, which is $25/hr to rent. If you didn't pay that much, you were not running R1, but rather Qwen or LLaMA based distillations. We paid that much to rent a machine to run the full 671B model! |
|
Heard there are some optimizations for CPU inference on storage, then it should be somewhat a tad "less slow".
Time to split that RAM among your CPU cores and mmap blocks of weights for inference from storage.