| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saltymimir 1375 days ago

I maintain something similar today, and I'm guessing that the OP uses some combination of the following libraries too (?):

- Readability (https://github.com/mozilla/readability) to strip down the page's HTML to a bare minimum.

- Turndown.js (https://github.com/mixmark-io/turndown) to convert the plain HTML to a markdown format with the GFM plugins enabled.

- Puppeteer (https://github.com/puppeteer/puppeteer) to download the page.

It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right.

Also, some unsolicited feedbacks on the API:

- An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable.

- You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`).

- Maybe an option to fill in the front matter (https://jekyllrb.com/docs/front-matter/) with a metadata given in the payload?

2 comments

masukomi 1375 days ago

Can you expand on this statement "It costs me only several cents to parse an entire page"? That sounds like quite a lot to me. We're talking _maybe_ a few seconds of compute time (if things are really slow) + a trivial amount of bandwidth.

Are you dividing the monthly hosting costs for a server by total seconds spent actually running this tool? I'm thinking if you did this with an AWS lambda it'd be free (maybe bandwidth cost, but again, trivial) unless you had way, _way_ more use than a single person could reasonably generate. Also, free if you used any of the free hosting services and were just doing it for a small number of users.

link

franciscop 1375 days ago

OP here, I've added server timing headers to https://content-parser.com/, the total fetch+parse is taking me around 0.6-1.2s. The local parsing as a separated step is sync, so I expected it to be negligible but it actually takes a good chunk, often 500-700ms! A lot more than I thought/expected here, I haven't seen any backend error yet but at some point might have to move this to a different thread or similar.

link

franciscop 1375 days ago

Thanks for the feedback! So far I plan on making this a stepping stone for a fully integrated HN reader, where you can read the whole thing in-page, and for those pages that cannot be parsed (paywalls etc) to just redirect to the original. I prefer not to circumvent any barriers nor hide the user agent for that, and in my situation instead just redirect to the original.

I should also find a better html-to-markdown parser, thanks for the recommendation there! From the "example", yes you guessed "readability" perfectly. And for downloading the page just fetch() + jsdom.

Suggestions:

- [JS]: I use fetch+jsdom, so no JS parsed at all! I've found most content-heavy websites (a.k.a. articles, blog posts, etc) are server-side-rendered, haven't searched too many but so far no issue without JS. Might move to puppeteer at some point for either failed parses with jsdom or for a domain whitelist if I keep one at some point.

- [header]: Already mentioned

- [Front matter]: Right now I'm actually returning two custom headers, `title` and `url`, might add more in the future. I did consider front-matter, but I want to keep the body as "raw" as possible.

- Edit: what I'm considering next is an endpoint to download articles with basic HTML style, or as pdf/epub.

link

neodymiumphish 1375 days ago

Could you automate browsing to an archive link instead when a link runs to a common paywalled site?

link