Hacker News new | ask | show | jobs
by JohannesKauf 592 days ago
There are two parts to it:

1) convert html to markdown

This is what my library specifically addresses, and I believe it handles this task robustly. There was a lot of testing involved. For example, I used the CommonCrawl Dataset to automatically catch edge cases.

2) Identify article content

This is the more challenging aspect. You need to identify and extract the main content while removing peripheral elements (navigation bars, sidebars, ads, etc.)

For example, the top of the markdown document will have lots of links from the navbar otherwise.

Mozilla's "Readability" project (and its various ports) is the most used solution in this space. However, it relies on heuristic rules that need adjustments to work on every website.

---

The html-to-markdown project in combination with some heuristic would be great match! There is actually a comment below [1] about this topic. Feel free to contact me if you start this project, would be happy to help!

[1] https://news.ycombinator.com/item?id=42094012

1 comments

I'm working on a Textify API that collates elements based on the visible/running flow of text elements. It's not quite there yet, but is able to get the running content of HTML pages quite consistently. You can check it out here:

[0]: https://github.com/dleeftink/plainmark