|
|
|
|
|
by regularfry
99 days ago
|
|
It's not clear where the efficiency frontier actually is. We're good at measuring size, we're good at measuring FLOPS, we're really not very good at measuring capability. Because of that, we don't really know yet whether we can do meaningfully better at 1 bit per parameter than we currently get out of quantising down to that size. Probably, is the answer, but it's going to be a while before anyone working at 1 bit per param has sunk as many FLOPS into it as the frontier labs have at higher bit counts. |
|
Currently the optimal training precision seems to be 8 bit (at least used by DeepSeek and some other open weight companies). But this might change with different training methods optimized for 1-bit training, like from this paper I linked before: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...