The point of using number sorting for this paper is that its
A) difficult to impossible for an LLM to do in a single pass
B) easy to verify the correctness.
In general, the point isn't finding things that only an LLM can do, but find things that LLMs can do with decent results at lower cost than getting a human to do it.
It is only difficult for a LLM to sort a list of numbers if the list is longer than half of the context window. (Source: I tested this myself[1]). The sorts are not error-free every time, but with sufficient training they become error-free the vast majority of the time, even for long lists. This is not especially surprising because transformers are capable of directly representing sorting programs.[2]
Of course you can train a neural network to sort numbers, but I'm talking about a general LLM which hasn't been trained to sort numbers specifically. Training a GPT network to sort numbers is not what I would consider to be a Large Language Model.
I don't think efficiency is important at this point. Finding that it's possible "this way" opens the door for more work and more applications. (Which doesn't prevent others to already work on efficiency.)
A) difficult to impossible for an LLM to do in a single pass B) easy to verify the correctness.
In general, the point isn't finding things that only an LLM can do, but find things that LLMs can do with decent results at lower cost than getting a human to do it.