Newer large datasets like the ones used here optimize for diversity. (e.g. SlimPajama is a heavily-deduped dataset)