| > Can you provide more details about what you mean by this distributional structure The distribution of sorted digits is: (0 1 2 3 4 5 6 7 8 9) before (1 before 0 1 2 3 4 5 6 7 8 9) before (2 before 0 1 2 3 4 5 6 7 8 9) before (3 before 0 1 2 3 4 5 6 7 8 9) ... ... When you compute the search space you're treating each number as a unique token (ie., that all ordinals are unique) -- but its not sorting unique ordinals, it's sorting digits in a sequential model ie., it learns P(Next|Prev) The (sequential) distribution of digits amongst sorted numbers is tiny |
This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.