Hacker News new | ask | show | jobs
by thephyber 3903 days ago
The hints were littered everywhere that they did this.

Google does malware detection. Not on every crawl, but a certain percentage of crawls. At my old social network site, they detected malware that must have come from ad/tracking networks because those pages had no UGC. This suggests they were using Windows virtual machines (among others) and very likely using browsers other than a heavily modified curl / wget and a headless Chrome.

They started crawling the JavaScript-rendered version of the web and AJAX schemes that use URL shebangs. This was explicit acknowledgement that they were running JavaScript and did advanced DOM parsing.

They have always told people that cloaking (either to Google crawler IP blocks, user-agent, or by other means) content is a violation and they actively punished it. This suggests they do content detection and likely execute JavaScript to detect if extra scripts change the content of the page for clients that don't appear to be Googlebot.

They have long had measures in place to detect invisible text (eg. white text on white background) or hidden text (where HTML elements are styled over other HTML elements). This suggests both CSS rendering and JS rendering.

2 comments

> At my old social network site, they detected malware that must have come from ad/tracking networks because those pages had no UGC. This suggests they were using Windows virtual machines (among others) and very likely using browsers other than a heavily modified curl / wget and a headless Chrome.

I think you're making a number of wild assumptions there. You can scan and detect malware without running Windows; and there's a whole gulf of different technologies between running desktop browsers and running a modified version of curl.

With regards to your browser point, normally I'd probably suggest that Google would be running node and making use of their own V8 Javascript engine to headlessly render the pages. However Google have the resources to build something much more bespoke so I think it would be foolish of me to make blind assumptions given how little I actually know about their internal technology.

> They have long had measures in place to detect invisible text (eg. white text on white background) or hidden text (where HTML elements are styled over other HTML elements). This suggests both CSS rendering and JS rendering.

No, this actually suggests it's not doing either. Both invisible and hidden text the way you've described it would be implemented with a CSS style. Not using that style would mean the text would appear as normal. I understand you probably meant that the JS was injecting the text in, which is fully possible, but that's neither hidden nor invisible text.

The parent is talking about them penalizing sites that use such hidden text that would normally show to the crawler but be invisible to an actual human looking at the page.