Hacker News new | ask | show | jobs
by bnewbold 1038 days ago
This tool is so great for robustly dealing with content in old and poorly formatted HTML. There are a lot of similar tools for extracting "the main text" from free-form HTML, but this was the most reliable in my experience, especially when dealing with web archives containing hand-written HTML back to the 1990s, working with non-English languages, etc.