| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pg 5326 days ago
	I sent you an email about this. (A couple weeks ago I banned all Google crawler IPs except one. Crawlers are disproportionately bad for HN's performance because HN is optimized to serve recent stuff, which is usually in memory.)

5 comments

pierrefar 5326 days ago

Hi Paul,

A site can be crawled from any number of Googlebot IP addresses, and so blocking all except one doesn't help in throttling crawling.

If you verify the site in Webmaster Tools, we have a tool you can use to set a slower crawl rate for Googlebot, regardless of which specific IP address ends up crawling the site.

Let me know if you need more help.

Edit Detailed instructions to set a custom crawl rate:

1. Verify the site in Webmaster Tools.

2. On the site's dashboard, the left hand side menu has an entry called Site Settings. Expand that and choose the Settings submenu.

3. The page there has a crawl rate setting (last one). It defaults to " Let Google determine my crawl rate (recommended)". Select "Set custom crawl rate" instead.

4. That opens up a form and choose his desired crawl rate in crawls per second.

If there is a specific problem with Googlebot, you can reach the team as follows:

1. To the right hand side of the Crawl Rate setting is a link called "Learn More". Click that to open a yellow box. 2. In the box is a link called Report a problem with Googlebot which will take you to form you can fill out with full details.

Thanks!

Pierre

aw3c2 5326 days ago

I would like to set that crawl rate but do not see why I must register at Google to do so. Why can't Google support the Crawl-Delay directive in robots.txt for this?

Matt_Cutts 5326 days ago

My plane is about to take off, but very briefly: people sometimes shoot themselves in the foot and get it way, way wrong. Like "crawl a single page from my website every five years" wrong.

Crawl-Delay is (in my opinion) not the best measure. We tend to talk about "hostload," which is the inverse: the number of simultaneous connections that are allowed.

anonfoobar1 5326 days ago

Another great way to shoot yourself in the foot (and getting it way way wrong) is to block all Googlebot IPs except for one.

jfoster 5326 days ago

Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.

officemonkey 5326 days ago

I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.

huxley 5325 days ago

As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?

::cough::

A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.

I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.

The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.

officemonkey 5325 days ago

Perhaps I'm making an error assuming that a website as influential as Hacker News has a "real live Webmaster" to do things like write robot.txt files.

Matt_Cutts 5325 days ago

I alluded to some of the ways that I've seen people shoot themselves in the foot in a blog post a few years ago: http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-y...

"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".

We can often figure out the intent of the site owner, but mistakes do happen.

adbge 5325 days ago

The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

If you're writing HTML, you should be validating it: http://validator.w3.org/

RobertKohr 5323 days ago

"...everything from tables nested to infinity..."

The irony of that statement on hacker news is pretty amazing. Have you looked at how the threads are rendered on this page. It is tables all the way down.

pilsetnieks 5325 days ago

Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.

aw3c2 5325 days ago

I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).

Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).

yesbabyyes 5325 days ago

Matt, how does a sitemap fit into this? If I'm not mistaken, you can suggest some refresh rate there, too. Do you take that into account?

runningdogx 5325 days ago

Could google vary the crawling rate on each site and see what effects that has on response times, and develop an algorithm to adjust crawl speed so as not to affect site performance too much? If google starts crawling a site and notices sequential crawl requests are answered in .5s w/ .1s stddev and it starts crawling with 10 parallel connections and the answers are 2s w/ 1s stddev, clearly that's a problem because user experience for real people will be impacted. Maybe google could automatically email webmaster@ and notify them of performance issues it sees when crawling.

Another thing that might help google is for them to announce and support some meta tag that would allow site owners (or web app devs) to declare how likely a page is to change in the future. Google could store that with the page metadata and when crawling a site for updates, particularly when rate limited via webmaster tools, it could first crawl those pages most likely to have changed. Forum/discussion sites could add the meta tags to older threads (particularly once they're no longer open for comments) announcing to google that those thread pages are unlikely to change in the future. For sites with lots of old threads (or lots of pages generated from data stored in a DB and not all of which can be cached), that sort of feature would help the site during google crawls and would help google keep more recent pages up to date without crawling entire sites.

ricardobeat 5325 days ago

> declare how likely a page is to change in the future

I believe you can do that using a sitemap.xml

wheels 5325 days ago

Have you considered putting a caching reverse proxy in front of the arc app to keep the backend from having to render all of the old pages?

It seems like the only dynamic element of old articles is the "$x days ago" bit and that'd be pretty easy to turn into something static by instead just putting in timestamps in the actual HTML and using Javascript to transform them into how many hours / days ago they were. Then the crawlers would just be pulling out cached, pre-rendered HTML.

There's an example of doing such with nginx here:

http://serverfault.com/questions/30705/how-to-set-up-nginx-a...

With that you'd just have to send out the HTTP header from the arc app saying that current articles expire immediately, and old ones don't.

pg 5325 days ago

I believe Rtm has already set one up.

wheels 5325 days ago

The conspicuous lack of a "Server:" header inclines me to believe that that's probably not the case (most web servers set one indicating the server software and version). Here are the headers that HN sends out from an old post (20 days ago):

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=utf-8
  Cache-Control: private
  Connection: close
  Cache-Control: max-age=0

bascule 5325 days ago

My favorite part of HN's headers: the lines are separated by naked LFs instead of CRLF, in violation of the HTTP spec

divtxt 5325 days ago

This is common violation that everyone accepts. It's definitely done by 'bad' clients - not sure how often servers send bare LF.

(I used to telnet to port 80 for testing, and type GET / HTTP/1.0 <enter> <enter>, and that should be LF on Linux & Mac)

bascule 5325 days ago

You don't have a problem with one of the most trafficked sites for programming/web startup-related news implementing HTTP incorrectly?

Do you ignore whether your HTML is valid just because the browser rendered it correctly?

marshray 5325 days ago

You don't know that everyone accepts it. Even if they did, it doesn't make it right.

slyall 5325 days ago

We use varnish for caching and check the useragent for requests.

If the cache has a copy of an article that is a few hours old it will just give that version to Googlebot while if it thinks a human is requesting the page then it will go to the backend and fetch the latest version.

https://www.varnish-cache.org/lists/pipermail/varnish-misc/2...

moe 5325 days ago

+1 for varnish. It's stupidly[1] fast and there shouldn't be much trickery required to deflect most of HN's traffic (e.g. ~10 sec expiry for "live" pages, infinite expiry for archived pages).

[1] 15k reqs/sec on a moderate box

Matt_Cutts 5326 days ago

Gotcha--thanks, Paul. I'm about to get on a plane, but we'll get this figured out where we're not sending as much hostload toward HN.

wgx 5326 days ago

If only we could all get our Google woes fixed in such a manner.

metaprinter 5326 days ago

seriously. i can't get a hospital to show up in google maps... no human for me to talk to. HN is number 4 instead of 1 google page one, they're right on it.

petercooper 5326 days ago

This is why I don't understand the anti-promotion crowd who think promoting oneself and building an audience is a bad thing. Having the implicit threat of an audience you can address is a major lever to getting decent service nowadays.

davidw 5325 days ago

I'm certainly not against promoting myself where I think some attention is merited, but ultimately that kind of thing is close to a zero sum game in that the amount of 'famous' people is fairly limited.

officemonkey 5326 days ago

Which hospital? Seems like this thread has at least one pair of google-eyeballs looking at it.

metaprinter 5324 days ago

i'd rather not drop the name here. but it's a verified listing and i've used the "report a problem" 3 times now. If they're not checking that... :(

Mamady 5326 days ago

Bad PR always seems to get instant action - it's more damage control than anything else.

If you have an audience, you have PR power.

kenjackson 5325 days ago

You can fix this yourself.

In the lower-right corner of Google Maps there is a tiny link that says, "Edit in Google Map Maker". Click this link and you can edit Google Maps. Your edits get sent to Google and they'll approve/deny it in typically a few days.

metaprinter 5324 days ago

it's a verified listing (do you know hard it is to convince the IT dept to take their automated phone system offline so i could verify a google maps listing? It was nuts and no google didn't offer the postcard method) so i don't see why i have to enter the same info again, but i did.

the listing only shows up if you type the exact name of the hospital into the search bar, which is useless.

ashishgandhi 5326 days ago

Does that mean at Google you can manually set the system to treat different sites differently?

I don't mean the ranking but other aspects - like you guys blacklisted some domains which produce low quality content in wholesale. (I don't know if the algorithm was tweaked to detect and filter such sources or if it was a manual thing.)

seabee 5326 days ago

Does that mean at Google you can manually set the system to treat different sites differently?

Webmaster Tools has a crawl rate slider which operate on a site-by-site basis, and that's existed for quite a while now.

If you're asking if they can manually boost a site's ranking, hopefully that isn't what's being suggested.

tibbon 5326 days ago

Could you use some sort of sitemap or other way to provide the data to Google that isn't so damaging to site performance? Or in Google Webmaster tools turn down the rate of crawling?

Just realized that this could be a problem for lots of sites, and I'm curious as to what the best solution is, since not everyone has Matt Cutts reading their site and helping out.

Matt_Cutts 5326 days ago

We do have a self-service tool in webmaster tools for people who prefer to be crawled slower.

arctangent 5326 days ago

Do you support/respect the "crawl-delay" directive?

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl...

storborg 5326 days ago

Nope. As mentioned above, apparently Google thinks people will "shoot themselves in the foot" with the crawl-delay directive, while they won't with Google's special interface (which requires registering and logging in).

wanorris 5325 days ago

I can't imagine that they are just guessing abut this. I'm sure someone tried implementing it and was horrified at the actual results before they gave up on it.

einhverfr 5325 days ago

What more can you expect from the World's Largest Adware Company?

Seriously, one thing about Google is that they seem to really like ensuring people are logged on, preferably at all times. Fortunately recent changes to Google Apps (promoting apps user accounts to full Google accounts) has made this more complex on my side and probably degraded the level of actionable info they can get out of it.

Thieum22 5326 days ago

or faster : http://news.ycombinator.com/item?id=2382728

jedberg 5325 days ago

reddit had the same problem. We set up a separate server just for the google crawler with it's own copy of the database, so that the queries for old pages didn't slow down everyone else.