|
|
|
|
|
by mjburgess
1096 days ago
|
|
So this is a really good starting point -- but you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something". Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right? In any case: The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals. But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...". |
|
Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.
> But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..."
This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question.