Hacker News new | ask | show | jobs
by franciscop 1376 days ago
Thanks for the feedback! So far I plan on making this a stepping stone for a fully integrated HN reader, where you can read the whole thing in-page, and for those pages that cannot be parsed (paywalls etc) to just redirect to the original. I prefer not to circumvent any barriers nor hide the user agent for that, and in my situation instead just redirect to the original.

I should also find a better html-to-markdown parser, thanks for the recommendation there! From the "example", yes you guessed "readability" perfectly. And for downloading the page just fetch() + jsdom.

Suggestions:

- [JS]: I use fetch+jsdom, so no JS parsed at all! I've found most content-heavy websites (a.k.a. articles, blog posts, etc) are server-side-rendered, haven't searched too many but so far no issue without JS. Might move to puppeteer at some point for either failed parses with jsdom or for a domain whitelist if I keep one at some point.

- [header]: Already mentioned

- [Front matter]: Right now I'm actually returning two custom headers, `title` and `url`, might add more in the future. I did consider front-matter, but I want to keep the body as "raw" as possible.

- Edit: what I'm considering next is an endpoint to download articles with basic HTML style, or as pdf/epub.

1 comments

Could you automate browsing to an archive link instead when a link runs to a common paywalled site?