Hacker News new | ask | show | jobs
by netr0ute 1459 days ago
Parent never claimed it was going to be fast.
1 comments

It would probably just fail with an error "[some function] not implemented for 'Half'"
fp16 models inference just fine in fp32, though I was sorta joking in my original comment, it would potentially take weeks for this to run one input. You're better off trying to make something like huggingface accelerate work (like the comment above), which swaps layers of the model on and off the disk