Hacker News new | ask | show | jobs
by Yenrabbit 833 days ago
We've got a technical post with all the juicy details coming next week. But that bit refers to packing the 4-bit weights into a type FSDP is happy to shard (like float16 or float32) which matches the other non-quantized bits of the model. This way FSDP will happily wrap and shard all the parameters as if they were just normal floats.