Hacker News new | ask | show | jobs
by rdoherty 3907 days ago
Wow, I built a project that rendered JS built webpages for search engines via NodeJS and PhantomJS. Rendering webpages is extremely CPU intensive, I'm amazed at the amount of processing power Google must have to do this at Internet scale.

I really hope this works, lots of JS libraries expect things like viewport and window size information, I wonder how Google is achieving that.

8 comments

I'm wonder if they're cutting out a lot of the rendering that PhantomJS is doing. Not to say that any type of rendering is cheap but I'm guessing they have a limited version of a JS rendering engine that does just enough to index the page.

I bet they'd also skip on all the FB like buttons and other common social media elements that don't impact the content.

This makes sense. Would it be sufficient to just see how the content (eg new <ul> elements or something along those lines) on the page changes when JS is executed, without actually rendering anything?
They're starting to consider page load speed as a factor in rankings, which would lead me to believe that they're letting all the social buttons / trackers / media load.
How do you know a page has loaded? A complex page with ads, AJAX, WebSockets may be constantly busy. Most social buttons, ads, etc. are now loaded by callbacks, that usually finish after the page has rendered.
Most of that data flow, barring user interaction is much more limited compared to the initial load of controls, iframes, images, etc... you can visibly see the drop off..

If you look at the network tab in chrome dev tools, you can see when the dom ready event fires, the window load event, and when it really feels the content was done loading. That final load time is when the data flow lulls out for a bit.

Can confirm. Launched a project recently with over 500 concurrent PhantomJS workers. Let's just say my hosting bill is significantly more expensive than it was.
> lots of JS libraries expect things like viewport and window size information, I wonder how Google is achieving that.

Just plug in common screen parameters (e.g. 1920x1080, 1366x768, ...) and analyze it as if it were the result you'd get by default with Chrome on such a screen, I would imagine.

Same goes for user agent spoofing (to some extent). You can imagine most of the stuff when you use the chrome dev tools being done without actual user interaction.
Chrome is much lighter than phantomjs. I use Awesomium which is a .net port of Chromium and it loads pages at half the time phantom does with much less CPU load. My guess is that Google can refine it even further.
I'm wondering if Google is somehow, in some way, using the rendering data generated by the Chrome clients and/or Android to aid with processing power it takes to index everything.
More likely they're getting lots of data from analytics users for a great number of sites as it is, and only really need to do custom renders for load time analysis for some sites, and not necessarily all pages... to a larger extent, I'm pretty sure they could have an optimized rendering pipeline for a headless chrome that actually works better than, by comparison, phantomjs.
I think they might mitigate the need to crawl _every_ page of every web site in that fashion. They must be doing some sort of analysis to "old-school-crawl" pages that don't need javascript interpretation.
What if they don't actually "render" the dom as part of the "load" analysis... this means they don't necessarily need to handle certain UI/UX aspects that can be bypassed.. they could then output the "rendered" content for passthrough to the same system that does their general crawl analysis for additional details.

The work could be broken up in any number of ways... from my own testing, and experience with others testing. Content crawls/recrawls from JS data tends to lag a couple days behind initial scan... having an updating sitemap xml resource is a good idea for "new" content if you're doing JS based content.. also, rescans will still lag well behind the general non-js content scans...

The viewport and window size is probably just for browser fingerprinting. Google probably just grabs the text, and has a pretty efficient fingerprinting system.
The XHTML+XSLT+XML-FO stack produced pages that took 3x-10x less CPU to render. But that's dead of course.
> The XHTML+XSLT+XML-FO stack produced pages that took 3x-10x less CPU to render. But that's dead of course.

Was it ever alive? I never found a decent browser XSL-FO renderer, there were some that seemed kind of proof-of-concept-ish (the only decent XSL-FO rendering I ever encountered was intended for print-like media, mostly PDF, rather than for browsing.)