Hacker News new | ask | show | jobs
by fancyfredbot 1172 days ago
When fine tuning an LLM you can use the LORA technique to make the fine tuning faster. LORA involves fine tuning a subset of parameters (really it's a low rank approximation of the weight matrix determined by picking the n largest eigenvalues in the SVD decomposition). The size of the subset is determined by the rank. The smaller the rank the faster the fine tuning. However if you make the rank too small then quality will suffer. So you want to pick the optimal rank. This paper describes a technique which can be used to find the optimal rank more easily.
1 comments

Fascinating progress.

Would you say the following understanding is correct?:

- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.

- You can fine-tune any model on any hardware, provided it fits into RAM. That means, that the 30B llama-derived models in their 4-bit quantized version and 19.5GB of VRAM requirement can be fine-tuned on consumer grade GPUs with 24gb of VRAM. (Like the RTX 3090 and 4090)

Yes to the first.

To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.

Is it possible for many people to simultaneously fine tune models on different data and then combine the new models into something improved?
One approach is to have the model learn to select between several separately fine tuned adapters by learning which adapter works best in a given context. So at any given time it's only really using one adapter but can switch to another. In this case one adapter can't really improve another but the overall impact might be a model which is improved in a variety of different contexts.
Yes, but the naïve way to combine rank k adaptations created by n different people would be to concatenate them to a rank nk adaptation, which wouldn't be as lightweight and easy to share, so you'd likely be better off mushing them into the baseline model.
Can they mathematically be “mushed” and then create an improved model?

I have yet to understand the difference between fine tuning and training and therefore yet to understand if a distributed decentralized eventually consistent training approach is a possibility or simply not realistic.

If you make N copies of a model, train them independently for a little while on N machines, and average them back together, it sort of works. But not if you train for very long, as the internal structure diverges.

It becomes an empirical engineering question how many parallel nodes you can train on for how long before averaging them back together. It's an expensive question to answer, since you have to train many variations to get the data.