| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dredmorbius 1913 days ago
	I assembled a similar decruftifier for the Washington Post specifically, using html-xml-utils (https://www.w3.org/Tools/HTML-XML-utils -- and some sed/awk) to strip only core article content & metadata (head, byline, dateline). Result was typically <5% of original HTML. I've come to realise that most online commercial publishing does not even use bold within body text, giving another filter trigger for stripping cruft.