| > itself not very well defined, but let's use IQ IQ has an issue that is inessential to the task at hand, which is how it is based on a population distribution. It doesn’t make sense for large values (unless there is a really large population satisfying properties that aren’t satisfied). > I doubt that too. The limit for LLMs for example is more human produced training data (a hard limit) than compute. Are you familiar with what AIXI is? When I said “arbitrarily large”, it wasn’t for laziness reasons that I didn’t give an amount that is plausibly achievable. AIXI is kind of goofy. The full version of AIXI is uncomputable (it uses a halting oracle), which is why I referred to the computable approximations to it. AIXI doesn’t exactly need you to give it a training set, just put it in an environment where you give it a way to select actions, and give it a sensory input signal, and a reward signal. Then, assuming that the environment it is in is computable (which, recall, AIXI itself is not), its long-run behavior will maximize the expected (time discounted) future reward signal. There’s a sense in which it is asymptotically optimal across computable environments (... though some have argued that this sense relies on a distribution over environments based on the enumeration of computable functions, and that this might make this property kinda trivial. Still, I’m fairly confident that it would be quite effective. I think this triviality issue is mostly a difficulty of having the right definition.) (Though, if it was possible to implement practically, you would want to make darn sure that the most effective way for it to make its reward signal high would be for it to do good things and not either bad things or to crack open whatever system is setting the reward signal in order for it to set it itself.) (How it works: AIXI basically enumerates through all possible computable environments, assigning initial probability to each according to the length of the program, and updating the probabilities based on the probability of that environment providing it with the sequence of perceptions and reward signals it has received so far when the agent takes the sequence of actions it has taken so far. It evaluates the expected values of discounted future reward of different combinations of future actions based on its current assigned probability of each of the environments under consideration, and selects its next action to maximize this. I think the maximum length of programs that it considers as possible environments increases over time or something, so that it doesn’t have to consider infinitely many at any particular step.) |
That's still a training set, just by another name.
And with the environment being the world we live in, it would be constrained by the local environment's possible states, the actions it can perform to get feedback on, and the rate of environment's response (the rate of feedback).
Add the quick state-space inflation in what it is considering, and it's an even tougher deal than getting more training data for an LLM.