Hacker News new | ask | show | jobs
by mike_hearn 1291 days ago
Just getting plain text out of the web without getting flooded with boilerplate, noise, SEO spam, duplication, infinity pages like calendars etc is already a hard data engineering problem.