|
|
|
|
|
by danielhanchen
808 days ago
|
|
I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement. I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where |
|
Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.
Is there actually something they gain from this particular method of splitting?