Hacker News new | ask | show | jobs
by rty32 592 days ago
Nice! And glad to see it's MIT licensed.

I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there.

1 comments

There are two parts to it:

1) convert html to markdown

This is what my library specifically addresses, and I believe it handles this task robustly. There was a lot of testing involved. For example, I used the CommonCrawl Dataset to automatically catch edge cases.

2) Identify article content

This is the more challenging aspect. You need to identify and extract the main content while removing peripheral elements (navigation bars, sidebars, ads, etc.)

For example, the top of the markdown document will have lots of links from the navbar otherwise.

Mozilla's "Readability" project (and its various ports) is the most used solution in this space. However, it relies on heuristic rules that need adjustments to work on every website.

---

The html-to-markdown project in combination with some heuristic would be great match! There is actually a comment below [1] about this topic. Feel free to contact me if you start this project, would be happy to help!

[1] https://news.ycombinator.com/item?id=42094012

I'm working on a Textify API that collates elements based on the visible/running flow of text elements. It's not quite there yet, but is able to get the running content of HTML pages quite consistently. You can check it out here:

[0]: https://github.com/dleeftink/plainmark