Hacker News new | ask | show | jobs
by mrskitch 2059 days ago
Thanks for the mention! I'm the founder of browserless.io, and agree with pretty much everything you're saying.

Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer's GH documentation page to build out our debugger's autocomplete tool. To do this, we "goto" the page, extract the page's content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let's you "offload" some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it's one we generally recommend to folks everywhere. Also a great way that we "dogfood" our own product as well :)

1 comments

What is the reason you are not just getting page content directly with HTTP request? Is headless browser providing some benefits in your case?
Yes: often the case is that JS does some kind of data-fetching, API calls, or whatever else to render a full page (single-page apps for instance). With Github being mostly just HTML markup and not needing a JS runtime we could have definitely gone that route. The rationale was that we had a desire to use our product ourselves, to gain better insight into what our users do, and become more empathetic to their cause.

In short: we wanted to dogfood the product at the cost of some time and machine resources