| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jstmm 855 days ago

In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time.

On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency.

It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models?

[0] https://arxiv.org/pdf/2402.17764.pdf

3 comments

londons_explore 855 days ago

GPU's aren't really designed for 1 bit math... They don't perform much faster than floating point math.

Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.

link

UncleOxidant 855 days ago

Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math. Pulling 1 or 2 bits out of a word isn't all that straightforward on CPU or GPU - lots of shifting and masking. I wonder how long it's going to be before we see custom hardware for bitnets? I suspect we'll see it on FPGAs first.

link

bee_rider 855 days ago

For 1 bit math, at least it should be possible to populate every other bit of an integer type, right? Surely one could do better with a dedicated type for this, but at least we could pack 16 single-bit weights into a 32 bit int for addition, right?

link

imtringued 855 days ago

You're telling me GPUs aren't designed for additions and subtractions? Where did you hear that?

link

bick_nyers 854 days ago

I think they are moreso saying that GPUs are not optimized for those operations. CPU aren't "designed" for matrix multiplies yet we can still run them, albeit at a slower rate than on a GPU.

link

shaklee3 854 days ago

A100 (> 5yo GPU) has a 1-bit tensor core engine

link

brucethemoose2 855 days ago

Real world GPU performance is hugely influenced by hand optimization of the CUDA kernels.

link

thatguysaguy 855 days ago

Sounds like these guys didn't use custom kernels, but BitNet did.

link

mobicham 854 days ago

That's correct. Only the dequantization is done on CUDA, the matmul is done with Pytorch. If they put their kernels open-source we could re-use them!

link