|
|
|
|
|
by rty32
592 days ago
|
|
Nice! And glad to see it's MIT licensed. I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there. |
|
1) convert html to markdown
This is what my library specifically addresses, and I believe it handles this task robustly. There was a lot of testing involved. For example, I used the CommonCrawl Dataset to automatically catch edge cases.
2) Identify article content
This is the more challenging aspect. You need to identify and extract the main content while removing peripheral elements (navigation bars, sidebars, ads, etc.)
For example, the top of the markdown document will have lots of links from the navbar otherwise.
Mozilla's "Readability" project (and its various ports) is the most used solution in this space. However, it relies on heuristic rules that need adjustments to work on every website.
---
The html-to-markdown project in combination with some heuristic would be great match! There is actually a comment below [1] about this topic. Feel free to contact me if you start this project, would be happy to help!
[1] https://news.ycombinator.com/item?id=42094012