Hacker News new | ask | show | jobs
by cotillion 3903 days ago
So they're actually evaluating all js and css Googlebot is consuming. That's insane.

Can we forget about any new competitors in search engine land now? Not only do you have to match Google in relevance you'll actually have to implement your own BrowserBot just to download the pages.

7 comments

The hints were littered everywhere that they did this.

Google does malware detection. Not on every crawl, but a certain percentage of crawls. At my old social network site, they detected malware that must have come from ad/tracking networks because those pages had no UGC. This suggests they were using Windows virtual machines (among others) and very likely using browsers other than a heavily modified curl / wget and a headless Chrome.

They started crawling the JavaScript-rendered version of the web and AJAX schemes that use URL shebangs. This was explicit acknowledgement that they were running JavaScript and did advanced DOM parsing.

They have always told people that cloaking (either to Google crawler IP blocks, user-agent, or by other means) content is a violation and they actively punished it. This suggests they do content detection and likely execute JavaScript to detect if extra scripts change the content of the page for clients that don't appear to be Googlebot.

They have long had measures in place to detect invisible text (eg. white text on white background) or hidden text (where HTML elements are styled over other HTML elements). This suggests both CSS rendering and JS rendering.

> At my old social network site, they detected malware that must have come from ad/tracking networks because those pages had no UGC. This suggests they were using Windows virtual machines (among others) and very likely using browsers other than a heavily modified curl / wget and a headless Chrome.

I think you're making a number of wild assumptions there. You can scan and detect malware without running Windows; and there's a whole gulf of different technologies between running desktop browsers and running a modified version of curl.

With regards to your browser point, normally I'd probably suggest that Google would be running node and making use of their own V8 Javascript engine to headlessly render the pages. However Google have the resources to build something much more bespoke so I think it would be foolish of me to make blind assumptions given how little I actually know about their internal technology.

> They have long had measures in place to detect invisible text (eg. white text on white background) or hidden text (where HTML elements are styled over other HTML elements). This suggests both CSS rendering and JS rendering.

No, this actually suggests it's not doing either. Both invisible and hidden text the way you've described it would be implemented with a CSS style. Not using that style would mean the text would appear as normal. I understand you probably meant that the JS was injecting the text in, which is fully possible, but that's neither hidden nor invisible text.

The parent is talking about them penalizing sites that use such hidden text that would normally show to the crawler but be invisible to an actual human looking at the page.
I don't know that I would go head-to-head with Google in crawling the entire web. However, I do see a lot of opportunities for "vertical search." That is -- search engines focused on specific, niche verticals (travel, healthcare, etc)

I'm working on a couple of projects in vertical search, and it is quite exciting. Sure, I'm building tech that Google had in 2005, but we are surprised with the results. We achieve search relevance simply by curating the sites we crawl (still in the thousands in some cases).

Do you have any links to share? I'm working on a side project for vertical search for programmers. Curating sites to crawl with source code, docs, mailing lists, QA, IRC and tutorials.

Trying to get away from the "W3Schools effect" [0], where outdated, terribly presented information or downright spammy pages are locked in the top results of Google by virtue of being around for so long, or by gaming search keywords [1].

[0] https://github.com/nathancahill/fuck-w3schools

[1] http://www.bigresource.com/

I don't have anything public, but I have been exploring strategies for gluing together different tech in order to accomplish our goals. Latest stack has been:

- wget / wpull / heretrix to produce .warcs across a single domain - have a filewatcher on a folder to process .warc into text and then push it into elasticsearch with relevant metadata - flask search frontend for querying / results

Happy to share my learnings elsewhere. (I pinged you on email)

That was my first reaction as well. "We've engineered a competitive advantage so why don't you throw out that hard work the helps our competitors."

I'm not sure where I sit on this, developers who want to be noticed by other engines will continue to focus on SEO, but how many engineers care about SEO that isn't Google?

Honestly, optimizing sites for search is just wrong. It's happening now, because search is not perfect and developers have to work around its imperfectness. But in the ideal future, web masters must design websites for users, not for search engines, in the first and only place. That's what's happening now and it's good sign.

Of course Google competitors must work hard. I don't see why that's a bad thing. It's not like Bing or Yandex are going to disappear in the foreseeable future.

Depends what locale's you are targeting I am doing some strategy proposals for a client to help with their move into Asia.

Biaudu is one SE that doesn't crawl JS well from my research.

You can use one of the many headless browsers available. Selenium, phantomjs, phantomjs+casper, webkit, chromium, awesomium, name your poison. All are quite competent in rendering modern web pages. You don’t need to reinvent the wheel.
Also, if you want a headless browser that uses solely a JRE, my project is https://github.com/machinepublishers/jbrowserdriver
Any idea how good java's nashorn is for this.
Nashorn isn't actually used in Java's WebView (which my project leverages). Nashorn is used elsewhere in the JRE and replaces Rhino from prior releases, but WebView has used something else entirely: JavaScriptCore. Details: http://stackoverflow.com/questions/30104124/what-javascript-...

But essentially on performance, it's comparable to a desktop browser but still slower than I'd like. Java 9 should support HTTP 2 and async HTTP by default, which might help. And I've been looking into short-cutting some of the in-memory rendering but haven't had any breakthroughs yet.

As far as JavaScriptCore engine specifically, it's the default in WebKit so there should be good performance data out there on it.

Thanks
PhantomJS allows you to render a page and fully manipulate or search it. It's a headless WebKit browser you can use from the command line and it works pretty well. Google is obviously doing the same thing. They even used to show images of what a url looks like in the search results. They stopped doing that as I suspect it uses up a lot of resources of many sites.
I can say that Bing definitely does do JS interpretation as part of some of their renderings... I switched the URL routing in a relatively large site (about 300k routes, including navigatable search urls), so that they were all consistent, and all pointing to the new routes via permanent redirect... previously the project was supporting all of their older routing schemes over time, and it was troublesome wrt SEO (duplicated content on many pages, or the same because search parameters were the same, but different structure, same for individual content pages). When the change happened, we saw a huge uptick in google analytics hits (one page, no clickthroughs) coming from two locations... both turned out to be MS data centers. It was a relatively common problem.

It was always just a little white noise in the past, but when suddenly a couple hundred thousand pages permanently redirect... it was interesting.

> your own BrowserBot

http://phantomjs.org/

So they're using headless browsers. Why can't anyone do that?
Scale
And perhaps security. I wouldn't be surprised if Google avoided standard C++/JIT browser engines in favor of something custom entirely written in a safe language - but if they don't, it wouldn't be that hard to get code execution on (a sandboxed portion of) Googlebot. Same goes for competitors - I don't think the state of public safe-language browsers is that good, though I'm not sure.
They are probably using virtual machines anyway, so it's not hard to set it up to simply load ram state for each new page they are crawling. This sidesteps the security issue (as long as there's no sandbox escapes).

It's possible they are using components from Google Chrome as others mentioned, like V8.