|
|
|
|
|
by dredmorbius
1913 days ago
|
|
I assembled a similar decruftifier for the Washington Post specifically, using html-xml-utils (https://www.w3.org/Tools/HTML-XML-utils -- and some sed/awk) to strip only core article content & metadata (head, byline, dateline). Result was typically <5% of original HTML. I've come to realise that most online commercial publishing does not even use bold within body text, giving another filter trigger for stripping cruft. |
|