| There are two parts to it: 1) convert html to markdown This is what my library specifically addresses, and I believe it handles this task robustly. There was a lot of testing involved. For example, I used the CommonCrawl Dataset to automatically catch edge cases. 2) Identify article content This is the more challenging aspect. You need to identify and extract the main content while removing peripheral elements (navigation bars, sidebars, ads, etc.) For example, the top of the markdown document will have lots of links from the navbar otherwise. Mozilla's "Readability" project (and its various ports) is the most used solution in this space. However, it relies on heuristic rules that need adjustments to work on every website. --- The html-to-markdown project in combination with some heuristic would be great match! There is actually a comment below [1] about this topic. Feel free to contact me if you start this project, would be happy to help! [1] https://news.ycombinator.com/item?id=42094012 |
[0]: https://github.com/dleeftink/plainmark