Regret analysis in bandit and similar algorithms shows how inference is connected to loss function. If your loss function is good, greedy inference is as good as joint inference.
Training on cost-to-go loss is good enough.
Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn.
The fact that probabilistic graphical models never really worked proves it somewhat.
Are there any good papers on this you would suggest/specific search terms?
I am vaguely aware of some stuff, but would love to study more, I don't quite understand what this is all about (but I do see how LLMs can do attention to all prior tokens so you don't have the single-point-of-failure HMMs do which more necessitates Viterbi decodes)
Training on cost-to-go loss is good enough. Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn. The fact that probabilistic graphical models never really worked proves it somewhat.