Hacker News new | ask | show | jobs
by a_wild_dandan 904 days ago
Are people still rawdoggin' 16-bit models? I almost exclusively use 5-bit inference quants (or 8-bit natives like Yi-34b) on my MacBook Pro. Tiny accuracy loss, runs fast, and leave plenty of (V)RAM on the table. Mixtral 8x7 is my new daily driver, and only takes like 40GB to run! I wonder if I could run two of them talking to each other...
2 comments

Pure 16bit is horrible for training, sorry.
Doesn't using bf16 alleviate the problem? At least I've had success training a Bert like model from scratch
Mixed precision is a default method to pretrain and full fine tune right now. It is especially good in transformers, because they have memory bottleneck in activations (outputs of intermediate layers stored for backprop), and running forward pass in fp16/bf16 reduces VRAM by almost half (speeds up forward pass as well).
I wonder about that too. With the small precision, parameter updates might be too small to have an effect (is it possible to use some sort of probabilistic update in that case?) Unfortunately, I haven’t found any resources describing the feasibility of full fp16 or bf16 training.
You are correct, training sorely in fp16/bf16 can lead to imprecise weight updates or even gradients turning to zero. Because of that, mixed precision is used. In mixed precision training, we keep a copy of the weights in fp32 (master model) and the training loop looks like this: compute the output with the fp16 model, then the loss -> back-propagate the gradients in half-precision -> copy the gradients in fp32 precision -> do the update on the master model (in fp32 precision) -> copy the master model in the fp16 model. We also do loss scaling which means multiplying the output of the loss function by some scalar number before backprop (necessary in fp16 but not required in bf16).

Check out the fastai docs for more details: https://docs.fast.ai/callback.fp16.html

Ah my bad. I am using mixed precision training in the my previous comment.

You might find this paper interesting: https://arxiv.org/pdf/2010.06192.pdf

Hmm, what do you mean? I thought bf16 is used extensively for LLM training.
How does one rawdog a 16-bit model?
Usually, for efficiency, you use quantized models. Quantized models reduce the number of bits available for each parameter, saving space and reduce RAM usage.