|
|
|
|
|
by jstmm
810 days ago
|
|
In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time. On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency. It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models? [0] https://arxiv.org/pdf/2402.17764.pdf |
|
Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.