|
|
|
|
|
by FieryTransition
492 days ago
|
|
But if you have a large set of problems to which you already know the answer, then using that in reinforcement learning, then wouldn't the expertise transfer later to problems with no known answers, that is a feasable strategy, right? Another issue is, how much data can you synthesize in such a way, so that you can construct both the problem and solution, so that you know the answer before using it as a sample. Ie, some problems are easier to make knowing you can construct the problem yourself, but if you were to solve said problems, with no prior knowledge, they would be hard to solve, and could be used as a scoring signal? Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale. Does that make sense or is it wrong? |
|
Sorry, long answer incoming. It is far from complete too but I think it will help build strong intuition around your questions.
Will knowledge transfer? That entirely depends on the new problem. It also entirely depends on how related the problem is. But also, what information was used to solve the pre-transfer state. Take LLMs for example. There's lots of works that have shown them being difficult to train for solving calculations. Where they will do well on problems with the same number of digits but this will degrade rapidly as number of digits increase. It can be weird to read some of these papers as there will sometimes be periodic relationships with the number of digits but that should give us information about how they're encoding the problems. But that lack of transferability indicates that despite the problem solving and what we'd believe is actually just the same problem, doesn't mean it is. So you have to be really careful here, because us humans are really fucking good at generalization (yeah, we also suck, but a big part is our proficiency makes us recognize where we lack. But also, this is more a "humans can" more than "humans do" type of thing. So be careful when comparing). This generalization is really because we're focused around building causal relationships, while on the other hand the ML algorithms are build around compression (i.e. fitting data). Which, if you notice, is the same issue I was pointing to above.
This entirely depends on the problem. We can construct simple problems that both illustrate success as well as failure. What you really need to think about here is the information gain from the answer. If you check how to calculate that, you will see the dependence (we could get into Bayesian Learning or experiment design but this is long enough). But let's think of a simple example in the negative direction. If I ask you to guess where I'm from, you're going to have a very hard time pinning down the exact location. Definitely in this example there is a efficient method, but our ML learning algorithms don't start with prior knowledge about strategies and so they aren't going to know to binary search. If you gave that to the model, you baked in that information. This is a tricky form of information leakage. It can be totally fine to bake in knowledge, but we should be aware of how that changes how we evaluate things (we always bake in knowledge btw. There is no escaping this). But most models would not have a hard time if instead we played "hot/cold", because the information gain is much higher. We've provided a gradient to the solution space. We might call this hard and soft labels, respectively.I picked this because there's a rather famous paper about emergent abilities (I fucking hate this term[0]) in ML models[1], and a far less famous counter to it[2]. There's a lot of problems with [1] that require a different discussion but [2] shows how a big part of the issue is how many of the loss landscapes are fairly flat and so when feedback is discrete the smaller models just wonder around that flat landscape needing to get lucky to find the optima (btw, this also shows that technically this can be done too! But that would require different training methods and optimizers). But when giving them continuous feedback (i.e. you're wrong, but closer than your last guess), they are able to actually optimize. A big criticism of the work is that it is an unfair comparison because there are "right and wrong" answers here, but it'd be naive to not recognize that some answers are more wrong than others. Plus, their work shows a clear testable way we can confirm or deny if this works or not. We schedule learning rates, there's no reason you cannot schedule labels. In fact, this does work.
But also look at the ways they tackled these problems. They are entirely different. [1] tries to do proof by evidence while [2] uses proof by contradiction. Granted, [2] has an easier problem since they only need to counter the claims of [1], but that's a discussion about how you formulate proofs.
So I'd be very careful when using the recent advancements in ML as a framework for modeling reasoning. The space is noisy. It is undeniable that we've made a lot of advancements but there is some issues with what work gets noticed and what doesn't. A lot does come down to this proof by evidence fallacy. Evidence can only bound confidence, it can unfortunately not prove things. But this is helpful and well, we can bound our confidence to limit the search space before we change strategies, right? I picked [1] and [2] for a reason ;) And to be clear, I'm not saying [1] shouldn't exist as a paper or that the researchers were dumb for doing it. Read back on this paragraph, because we've got multiple meta layers here. It's good to place a flag in the ground, even if it is wrong, because you gotta start somewhere, and science is much much better at ruling things out than ruling things in. We more focus on proving things don't work until there's not much left and then accept those things (limits here too, but this is too long already).
I'll leave with this, because now there should be a lot of context that makes this much more meaningful: https://www.youtube.com/watch?v=hV41QEKiMlM
[0] It significantly diverges from the terminology used in fields such as physics. ML models are de facto weakly emergent by nature of composition. But the ML definition can entirely be satisfied by "Information was passed to the model but I wasn't aware of it" (again, same problem: exhaustive testing)
[1] (2742 citations) https://arxiv.org/abs/2206.07682
[2] (447 citations) https://arxiv.org/abs/2304.15004