|
|
|
|
|
by wokwokwok
736 days ago
|
|
The website says: > At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query. The paper says: > At inference time, only the relevant experts are retrieved from the index, allowing the LLM to store a large number of facts while maintaining low inference latency. We use specialized GPU kernels written in Triton Tillet et al. (2019) to accelerate the lookup of experts. ...but darned if I can understand from either what they're actually doing when they say that. Why do you need a custom GPU kernel for this outside of the normal NN layers? Can anyone see an explanation of how they pick which expert to use? |
|
They do reference some papers I’m not familiar with and say their method is “similar”.
If you check the huggingface page mentioned in a footnote, they have two directories: one for a model, and the other which contains a FAISS index. Although in the paper they say they use cross attention, so I have no idea how those could be combined.