Hacker News new | ask | show | jobs
by sieszpak 743 days ago
I agree, the information provided in this article is a treasure. Maybe someone will add some "magic sauce" to it?
1 comments

The deduplication discussion shows they don't filter out ads as part of their cleaning - I appreciate this could be risky and perhaps a huge processing step given dataset sizes, but intuitively it feels like it would cut the noise dramatically and thus help tbe signal within datasets.