There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.