Hacker News new | ask | show | jobs
by atemerev 432 days ago
10 million records is a toy dataset. Usually, you can fit it in memory on a laptop.

There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.