| I maintain something similar today, and I'm guessing that the OP uses some combination of the following libraries too (?): - Readability (https://github.com/mozilla/readability) to strip down the page's HTML to a bare minimum. - Turndown.js (https://github.com/mixmark-io/turndown) to convert the plain HTML to a markdown format with the GFM plugins enabled. - Puppeteer (https://github.com/puppeteer/puppeteer) to download the page. It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right. Also, some unsolicited feedbacks on the API: - An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable. - You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`). - Maybe an option to fill in the front matter (https://jekyllrb.com/docs/front-matter/) with a metadata given in the payload? |
Are you dividing the monthly hosting costs for a server by total seconds spent actually running this tool? I'm thinking if you did this with an AWS lambda it'd be free (maybe bandwidth cost, but again, trivial) unless you had way, _way_ more use than a single person could reasonably generate. Also, free if you used any of the free hosting services and were just doing it for a small number of users.