| HN Mirror

The thing with efficiency is that it is relative to both inference and training compute. If you do quantization, you need a more powerful higher precision model to quantize from, which doesn't exist if you want to create a frontier model. In this case the question is only whether you get better inference and/or training performance from training e.g. a native 1 bit model.

Currently the optimal training precision seems to be 8 bit (at least used by DeepSeek and some other open weight companies). But this might change with different training methods optimized for 1-bit training, like from this paper I linked before: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...