| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by magicalhippo 736 days ago

Bit late to the party and I'm not into the AI scene, but from glossing over the three key papers they cite to describe their model, my take is as follows.

The idea from LoRA[1] is to take pre-trained, dense weights W_0 for a model and adapt them to new training data by using the weights W = W_0 + BA for inference. The key is that the matrices A and B have a very low rank[2] compared to W_0. Training is then done by treating W_0 as constant, and only updating the A and B matrices.

The idea from MoE[3] is to pick just a few "experts" from a large number using Softmax, and insert them between other layers in a neural network. The "experts" can be just some simple matrices or neural networks in their own right.

The Lamini model seems to combine these ideas, where they use several "experts" layered between the BA matrices. However instead of just a simple Softmax to select the "experts" they use (chunked?) cross-attention[4], and a much larger number of "experts" compared to the MoE paper. From what I gather the cross-attention allows the expert selection to react to the context, unlike the more naive plain Softmax gating approach.

Similarly to LoRA, they train the Lamini model by keeping the pre-trained LLM weights constant. From what I can gather they do a bit of training on the cross-attention layer but then freeze that too, to avoid the cross-attention layer favoring the same "experts" all the time.

The idea then is to train the "experts" and the LoRA layer on facts until the combined model (W above) gets each fact correct (zero loss).

Thus when the model "sees" a keyword in a sentence, a set of "experts" will steer/adjust the output of the pre-trained LLM to output the correct fact. Or at least that's how I imagine it works.

What's less clear to me is what exactly the "experts" are and how they are combined.

Since they're used to adjust the weights of the combined model, I it makes the most sense to my un-trained eye that they're simple matrices as mentioned in the MoE paper. Given they're layered between the low-rank portions of the LoRA section, they're necessarily relatively small matrices, so having millions of these "expert" matrices doesn't add too many parameters overall.

And while they're portrayed as stacked in the Lamini paper, suggesting matrix multiplication, matrices are generally not commutative[5]. So to the un-trained eye it seems likely the "experts" are simply added like in the MoE paper.

But yeah, I'm very much not an expert and the paper was more like a conference poster at best, lacking a lot of detail, so this might be all gibberish and I'd appreciate being corrected.

[1]: https://arxiv.org/abs/2106.09685

[2]: https://en.wikipedia.org/wiki/Rank_factorization

[3]: https://arxiv.org/abs/1701.06538

[4]: https://arxiv.org/abs/2112.04426

[5]: https://en.wikipedia.org/wiki/Commuting_matrices

1 comments

vessenes 735 days ago

Thanks for this substantive reply.

In this case, the experts could literally be routing / weighting of the LoRas, so it could be a 1x100k (or 1mm or whatever) vector , binary, maybe for simplicity and size. Or floats that are capped to 1/0 at inference time, but trained as floats.

The thing that’s a little weird to me is that you’d need to keep retraining the experts. But, I guess it may just be part of the pipeline for adding custom knowledge to the system.

link

magicalhippo 735 days ago

> In this case, the experts could literally be routing / weighting of the LoRas

Hmm yes, good point. Hard to tell with so little to go on.

edit: I assumed they were matrices given they were squares in the figure, just squashed to fit in the LoRA stackup, and given that they'd be low-dimensional so few parameters due to that.

> The thing that’s a little weird to me is that you’d need to keep retraining the experts.

Yeah my impression was this is more for static knowledge, like if you wanted to have a Wikipedia-assistant say.

It got me thinking though, if it could be a stepping stone towards something more dynamic. Say could you use something like the W = W_0 + dW idea to tweak the cross-attention mechanism to select newly added experts somehow?

Again, not into the AI scene, just like entertaining these shower thoughts.

link