|
|
|
|
|
by HarHarVeryFunny
943 days ago
|
|
I expect self-consistency might be one useful reward function. Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ... The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa). Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data. |
|