Hacker News new | ask | show | jobs
by dredmorbius 1913 days ago
I assembled a similar decruftifier for the Washington Post specifically, using html-xml-utils (https://www.w3.org/Tools/HTML-XML-utils -- and some sed/awk) to strip only core article content & metadata (head, byline, dateline). Result was typically <5% of original HTML.

I've come to realise that most online commercial publishing does not even use bold within body text, giving another filter trigger for stripping cruft.