Hacker News new | ask | show | jobs
by thoop 3118 days ago
Todd from Prerender.io here. We always knew this day would come eventually :) We are currently serving around 60 million prerendered pages to crawlers every day, with Google being about half of those requests. We are recaching around 1 billion pages every month in PhantomJS/Headless Chrome. Google is the only crawler executing a meaningful amount of JavaScript so Bing, Baidu, Yandex, Facebook, Twitter, and other SEO crawlers still need prerendering.

For anyone that will need to update their own crawlers to match Google’s new javascript crawling, we’ve opened up our Prerendering engine that uses Headless Chrome at https://prerender.com. You can capture HTML, Screenshots, PDFs, or even HAR files from any web page with just an http request to our service. So it’s super easy to add javascript crawling to any crawler with Prerender.com (and it’s open source https://github.com/prerender/prerender).

For our Prerender.io customers, this announcement just means that Google will stop crawling ?_escaped_fragment_= URLs so they won’t request prerendered pages anymore. Instead, Google will just execute the javascript directly and index the result.

We’ve always recommended that our customers use the escaped fragment protocol, so it will be a smooth transition as Google slowly stops crawling the ?_escaped_fragment_= URLs. No changes need to be made if you are currently using Prerender.io. Keep an eye on our twitter (@prerender) and we’ll give updates on Google’s transition.

The one thing to look for when Google starts executing your javascript is keep an eye on your Google Webmaster Tools for your number of pages crawled by Google. In the past, we’ve seen that Google crawls much slower when executing the javascript themselves. Hopefully javascript websites don’t take a hit in number of pages crawled daily since that can affect large sites having all of their pages up to date in Google's index.

5 comments

This isn't so much a service question as a technical bogglement, but I'm very curious what sort of heavy lifting you need to do 60 million renders a day.

Assuming renders take 7-10 seconds at worst, that means (if I've got my math right!) that you need to do between (60m/(86400/7)=4861) and (60m/(86400/10)=6944) renders per second in order to keep up. (86400 = seconds in a day)

...Ahahahaha :)

Given that a single Chrome instance on my new-but-not-particularly-amazing i3 box can be sluggish at the best of times... I have no idea what sort of tolerances Xeon(?)-class hardware (possibly running Xen? :P) have to running multiple entire copies of Chromium... I initially wondered if you needed 1000 compute instance, then I realized maybe you only needed 400, now I honestly don't know at all.

--

I'm also curious how using Headless Chrome and PhantomJS is working out. As in, genuinely interested. IIUC my understanding is that PhantomJS has pretty much wound down, while Headless Chrome is fractionally different enough from Chrome that it's possible to tell which one you're running on (https://news.ycombinator.com/item?id=14936025). I've been idly curious about "perfectly sandboxing" webpages so they honestly can't tell they're not in a "normal" PC/laptop/mobile environment, and my impression is that I'd have to start with a _very_ carefully configured copy of normal Chromium in order to do it.

--

I must admit that I got curious at what 60m monthly renders looked like against the pricing structure... but couldn't really figure it out, it's not a simple enough exponential curve (and I can't math for nuts). Single-stepping through the pricing algorithm was very interesting though ($1522 for enterprise, huh cool).

--

PS. The view-source link at the bottom is unfortunately broken; Chrome blocked opening such URLs recentlyish. Fixing it will likely require, ironically, a little server-side renderer :)

--

EDIT: One last thing, note https://news.ycombinator.com/item?id=15882066 from this thread

Yep, we have LOTS of servers :) We pretty heavily cache pages too.

Headless Chrome is great and we're super thankful that the Chromium team put the work in! PhantomJS is good... it just doesn't have all of the latest features, like ES6. So it was really helpful that headless Chrome came along right as people started using more ES6.

Yeah, Chrome did break the opening of view-source URLs a while back for our https://prerender.io/ buttons on the bottom of the homepage.

Todd,

prerender is almost what I have been looking for a while! I am now having a great weekend!

I want to scrape a JSON object within a script tag in an AJAX heavy webpage - if prerender removes all script tags, does that mean I can't use prerender for my project?

Is there a way to tell prerender not to remove certain scripts?

PS: I intend to use prerender to replace scrapy-splash middleware. Does your team have any plans to build a scrapy middleware?

So do you have to pivot now? Seems like this will hugely impact your US/Europe customers
We'll keep working on improving Prerender.io and we'll continue serving our customers exactly as we are now. So no pivot for Prerender.io.

Prerender.com is sort of a pivot, but more like just opening up our infrastructure to other crawlers that would like to execute javascript too. So we're expanding our offerings :)

I'd be really interested in how you planned for this sort of thing? I'd imagine it would be quite scary to have a business model that you knew would probably disappear in an instant.
Interesting pivot! I built a similar set of services (page.rest, screen.rip & pdf.cool) on top of Chrome Headless too.

Do you think Google indexing also uses Chrome Headless now to index the pages?