I don't understand why you can't get explainable models simply by training two LLMs. The first one has to tell the second one what to do (in English). The second one follows English instructions.
The work seems to generate per-instance weights that describe the features based on the effect that they have on the outcome. How would you propose to do that with two LLMs?
This idea has a lot of potential, deep learning is normally very abstract. Is there a way to combine this with common libraries like pytorch or tensorflow?