|
> you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something" Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list. > But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..." This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question. |
Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.
In any case,
I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.
Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".
At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.