| HN Mirror

That's correct, and their recent work on natural language autoencoders has given extremely compelling evidence of that...which is why their data collection practices for pre-training have almost certainly evolved, particularly since they've already scraped most of the internet.