Hacker News new | ask | show | jobs
by vmatouch 2059 days ago
For more generic web indexing you need to use a browser. You do not index pages served by a server anymore, you index pages rendered by javascript apps in the browser. So as a part of the "fetch" stage I usually let parsing of title and other page metadata to a javascript script running inside the browser (using https://www.browserless.io/) and then as part of the "parse" phase I use cheerio to extract links and such. It is very tempting to do everything in the browser, but architecturally it does not belong there. So you need to find the balance that works best for you.
3 comments

Thanks for the mention! I'm the founder of browserless.io, and agree with pretty much everything you're saying.

Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer's GH documentation page to build out our debugger's autocomplete tool. To do this, we "goto" the page, extract the page's content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let's you "offload" some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it's one we generally recommend to folks everywhere. Also a great way that we "dogfood" our own product as well :)

What is the reason you are not just getting page content directly with HTTP request? Is headless browser providing some benefits in your case?
Yes: often the case is that JS does some kind of data-fetching, API calls, or whatever else to render a full page (single-page apps for instance). With Github being mostly just HTML markup and not needing a JS runtime we could have definitely gone that route. The rationale was that we had a desire to use our product ourselves, to gain better insight into what our users do, and become more empathetic to their cause.

In short: we wanted to dogfood the product at the cost of some time and machine resources

Maintainer of jsdom here. jsdom will run the JavaScript on a page, so it can get you pretty far in this regard without a proper browser. It has some definite limitations, most notably that it doesn't do any layout or handling of client-side redirects, but it allows scraping of most single-page client-side-rendered apps.
Not necessarily. It is true that most websites today are JavaScript heavy. However, they are server-side rendered more often than not. Mostly for performance reasons. Also, not all search engines are as good as Google at indexing dynamic JS websites, so it's better to serve pre-rendered HTML for that reason as well.