Hacker News new | ask | show | jobs
by pierrefar 5326 days ago
Hi Paul,

A site can be crawled from any number of Googlebot IP addresses, and so blocking all except one doesn't help in throttling crawling.

If you verify the site in Webmaster Tools, we have a tool you can use to set a slower crawl rate for Googlebot, regardless of which specific IP address ends up crawling the site.

Let me know if you need more help.

Edit Detailed instructions to set a custom crawl rate:

1. Verify the site in Webmaster Tools.

2. On the site's dashboard, the left hand side menu has an entry called Site Settings. Expand that and choose the Settings submenu.

3. The page there has a crawl rate setting (last one). It defaults to " Let Google determine my crawl rate (recommended)". Select "Set custom crawl rate" instead.

4. That opens up a form and choose his desired crawl rate in crawls per second.

If there is a specific problem with Googlebot, you can reach the team as follows:

1. To the right hand side of the Crawl Rate setting is a link called "Learn More". Click that to open a yellow box. 2. In the box is a link called Report a problem with Googlebot which will take you to form you can fill out with full details.

Thanks!

Pierre

2 comments

I would like to set that crawl rate but do not see why I must register at Google to do so. Why can't Google support the Crawl-Delay directive in robots.txt for this?
My plane is about to take off, but very briefly: people sometimes shoot themselves in the foot and get it way, way wrong. Like "crawl a single page from my website every five years" wrong.

Crawl-Delay is (in my opinion) not the best measure. We tend to talk about "hostload," which is the inverse: the number of simultaneous connections that are allowed.

Another great way to shoot yourself in the foot (and getting it way way wrong) is to block all Googlebot IPs except for one.
Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.
I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.
As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?

::cough::

A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.

I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.

The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.

Perhaps I'm making an error assuming that a website as influential as Hacker News has a "real live Webmaster" to do things like write robot.txt files.
I alluded to some of the ways that I've seen people shoot themselves in the foot in a blog post a few years ago: http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-y...

"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".

We can often figure out the intent of the site owner, but mistakes do happen.

The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

If you're writing HTML, you should be validating it: http://validator.w3.org/

Google.com has 39 errors and 2 warnings. Among other things, they don't close their body or html tags.

Is there any real downside to having syntax errors?

I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

The web would have died in stillbirth and it would never have grown to where it is now.

"Be generous in what you accept" (part of Postel's Law) is a cornerstone of what made the internet great.

XHTML had a "die upon failure" mode, and it has died, why do you think XHTML was abandoned and lots of people are using HTML5 now.

"...everything from tables nested to infinity..."

The irony of that statement on hacker news is pretty amazing. Have you looked at how the threads are rendered on this page. It is tables all the way down.

Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.
I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).

Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).

Matt, how does a sitemap fit into this? If I'm not mistaken, you can suggest some refresh rate there, too. Do you take that into account?
Could google vary the crawling rate on each site and see what effects that has on response times, and develop an algorithm to adjust crawl speed so as not to affect site performance too much? If google starts crawling a site and notices sequential crawl requests are answered in .5s w/ .1s stddev and it starts crawling with 10 parallel connections and the answers are 2s w/ 1s stddev, clearly that's a problem because user experience for real people will be impacted. Maybe google could automatically email webmaster@ and notify them of performance issues it sees when crawling.

Another thing that might help google is for them to announce and support some meta tag that would allow site owners (or web app devs) to declare how likely a page is to change in the future. Google could store that with the page metadata and when crawling a site for updates, particularly when rate limited via webmaster tools, it could first crawl those pages most likely to have changed. Forum/discussion sites could add the meta tags to older threads (particularly once they're no longer open for comments) announcing to google that those thread pages are unlikely to change in the future. For sites with lots of old threads (or lots of pages generated from data stored in a DB and not all of which can be cached), that sort of feature would help the site during google crawls and would help google keep more recent pages up to date without crawling entire sites.

> declare how likely a page is to change in the future

I believe you can do that using a sitemap.xml