Hacker News new | ask | show | jobs
by jstmm 810 days ago
In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time.

On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency.

It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models?

[0] https://arxiv.org/pdf/2402.17764.pdf

3 comments

GPU's aren't really designed for 1 bit math... They don't perform much faster than floating point math.

Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.

Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math. Pulling 1 or 2 bits out of a word isn't all that straightforward on CPU or GPU - lots of shifting and masking. I wonder how long it's going to be before we see custom hardware for bitnets? I suspect we'll see it on FPGAs first.
For 1 bit math, at least it should be possible to populate every other bit of an integer type, right? Surely one could do better with a dedicated type for this, but at least we could pack 16 single-bit weights into a 32 bit int for addition, right?
You're telling me GPUs aren't designed for additions and subtractions? Where did you hear that?
I think they are moreso saying that GPUs are not optimized for those operations. CPU aren't "designed" for matrix multiplies yet we can still run them, albeit at a slower rate than on a GPU.
A100 (> 5yo GPU) has a 1-bit tensor core engine
Real world GPU performance is hugely influenced by hand optimization of the CUDA kernels.
Sounds like these guys didn't use custom kernels, but BitNet did.
That's correct. Only the dequantization is done on CUDA, the matmul is done with Pytorch. If they put their kernels open-source we could re-use them!