| HN Mirror

I'd say the test is still not quite valid, and more of in between the original "valid" task and "guess what LLM would say" as suggested in another comment here. The reason is: it might be easier for LLMs to choose the completion out of their own generated variants (1) than the real token distribution.

1. perhaps even out of variants generated by other LLMs