Hacker News new | ask | show | jobs
by ACCount37 237 days ago
It is true. Datasets are somewhat cleaned, but only somewhat. When you have terabytes worth of text, there's only so much cleaning you can do economically.