| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wokwokwok 736 days ago

The website says:

> At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

The paper says:

> At inference time, only the relevant experts are retrieved from the index, allowing the LLM to store a large number of facts while maintaining low inference latency. We use specialized GPU kernels written in Triton Tillet et al. (2019) to accelerate the lookup of experts.

...but darned if I can understand from either what they're actually doing when they say that.

Why do you need a custom GPU kernel for this outside of the normal NN layers?

Can anyone see an explanation of how they pick which expert to use?

1 comments

janalsncm 736 days ago

Agreed, I looked through their “paper” and while it goes through the motions of a scientific paper, there’s barely any reproducible methodology. A single page in their paper, including the diagram.

They do reference some papers I’m not familiar with and say their method is “similar”.

If you check the huggingface page mentioned in a footnote, they have two directories: one for a model, and the other which contains a FAISS index. Although in the paper they say they use cross attention, so I have no idea how those could be combined.

gdiamos 735 days ago

That’s fair - I’ll try to go through the weekend and write out some of the equations for the kernel that loads the weights out of the index and does the adaptor ops. It’s inspired by cross attention in retro but there are some differences for training stability and to use as an adaptor rather than training from scratch.

I consider that paper an early draft - hot off the press so to say - it needs review & editing before we would submit it to a conference. I tend to prefer a few rounds of open review before a final submission these days anyways - so appreciate the feedback

I think the main idea should be reproducible - you can repeat the randomization and generalization tests with any LLM and get similar training curves and eval results - it just wouldn’t be efficient.

We have tried it on about 5 real customer use cases with different facts and good success. Obviously we can’t publish customer data to reproduce which is why we focused on the randomization tests in the paper .

There are also some missing hyper parameters from the appendix as well we will add eventually