|
|
|
|
|
by anuragvohraec
1247 days ago
|
|
isn't it curve fitting at the end of the day. A multi parameter curve fitting ?
why do people say they don't now how it works. Yeah i get it that the cocktail is fairly complex, after training it on very huge dataset (all most all possible logical scenarios). But telling it we do not know how it works, seems like just adding mysticism to it, which attracts "clicks", but is not an honest description. |
|
"Curve Fitting" is the objective, the function encoded in the weights is the solution, and not actually well understood. See work from Anthropic[1] and Google[2] that explores this.
As an analogy consider applying the same argument to the AlphaGo value function. It's "just" fitting a bunch of curves to the statistics of millions of self-played games. However, to effectively capture those statistics the network needed to develop a bunch of heuristics. Needless to say these heuristics are not understood (else we'd already know the principles needed to play at AlphaGo's level), and are not just exhaustive lists of statistical trends but more like strategies[3].
Recent work[4] strongly suggests that "grokking" (a striking but not unnatural[5] form of generalization) involves networks transitioning from memorized statistics/solutions to a general solution. The curve fitting perspective would totally miss all this for a comfortable but misleading story: "the objective is curve fitting so it's just interpolating data points".
[1] https://transformer-circuits.pub
[2] https://arxiv.org/abs/2212.07677
[3] https://www.pnas.org/doi/10.1073/pnas.2206625119
[4] https://arxiv.org/abs/2301.05217
[5] https://arxiv.org/abs/2210.01117