Ah, that makes sense. So (very) basically, they're putting a number of regular LLMs into a sort of compute chain/graph, where one LLM feeds into the other, then doing gradient descent on the whole chain at once, essentialy treating the boundaries between LLM n and LLM n+1 as "hidden layers"?