| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aw3c2 5325 days ago
	I would like to set that crawl rate but do not see why I must register at Google to do so. Why can't Google support the Crawl-Delay directive in robots.txt for this?

1 comments

Matt_Cutts 5325 days ago

My plane is about to take off, but very briefly: people sometimes shoot themselves in the foot and get it way, way wrong. Like "crawl a single page from my website every five years" wrong.

Crawl-Delay is (in my opinion) not the best measure. We tend to talk about "hostload," which is the inverse: the number of simultaneous connections that are allowed.

anonfoobar1 5325 days ago

Another great way to shoot yourself in the foot (and getting it way way wrong) is to block all Googlebot IPs except for one.

jfoster 5325 days ago

Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.

officemonkey 5325 days ago

I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.

huxley 5325 days ago

As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?

::cough::

A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.

I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.

The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.

officemonkey 5325 days ago

Perhaps I'm making an error assuming that a website as influential as Hacker News has a "real live Webmaster" to do things like write robot.txt files.

Matt_Cutts 5325 days ago

I alluded to some of the ways that I've seen people shoot themselves in the foot in a blog post a few years ago: http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-y...

"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".

We can often figure out the intent of the site owner, but mistakes do happen.

adbge 5325 days ago

The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

If you're writing HTML, you should be validating it: http://validator.w3.org/

MatthewPhillips 5325 days ago

Google.com has 39 errors and 2 warnings. Among other things, they don't close their body or html tags.

Is there any real downside to having syntax errors?

thristian 5325 days ago

The downside is maintainability. If your website follows the rules, you can be pretty confident that any weird behaviour you see is a problem with the browser (which is additional context you can use when googling for a solution). If your website requires browsers to quietly patch it into a working state, you have no guarantees that they'll all do it the same way and you'll probably spend a bunch of time working around the differing behaviour.

Obviously, that's not a problem if you already know exactly how different browsers will treat your code, or you're using parsing errors so elemental that they must be patched up identically for the page to work. For example, on the Google homepage, they don't escape ampersands that appear in URLs (like href="http://example.com/?foo=bar&baz=qux — the & should be &). That's a syntax error, but one that maybe 80% of the web commits, so any browser that couldn't handle it wouldn't be very useful.

lambda 5325 days ago

It's interesting that you apparently actually checked a validator to get the error count, and yet the two things you cite as errors are not errors, have never been errors, and are not listed in the errors returned by the validator. Both opening and closing tags for the html, body, and head elements are optional, in all versions of HTML that I am aware of (outside of XHTML, which has never been seriously used on the open web as it isn't supported by IE pre-9). There is a tag reported unclosed by the validator, but that's the center tag.

Anyhow, one downside to having syntax errors might be that parsers which aren't as clever as those in web browsers, and which haven't caught up with the HTML5 parser standard, might choke on your page. This means that crawlers and other software that might try to extract semantic information (like microformat/microdata parsers) might not be able to parse your page. Google probably doesn't need to worry about this too much; there's no real benefit they get from having anyone crawl or extract information from their home page, and there is significant benefit from reducing the number of bytes as much as possible while still remaining compatible with all common web browsers.

I really wish that HTML5 would stop calling many of these problems "errors." They are really more like warnings in any other compiler. There is well-defined, sensible behavior for them specified in the standard. There is no real guesswork being made on the part of the parser, in which the user's intentions are unclear and the parser just needs to make an arbitrary choice and keep going (except for the unclosed center tag, because unclosed tags for anything but the few valid ones can indicate that someone made a mistake in authoring). Many of the "errors" are stylistic warnings, saying that you should use CSS instead of the older presentational attributes, but all of the presentational attributes are still defined and still will be indefinitely, as no one can remove support for them without breaking the web.

JangoSteve 5325 days ago

In Google's case, neglecting to close the tags is intentional, for performance. See http://code.google.com/intl/fr-FR/speed/articles/optimizing-...

gcr 5325 days ago

For the Google homepage, every byte counts. I'm not surprised.

ricardobeat 5325 days ago

Not really, but there are many benefits to keeping your HTML clean. Google seems to just use whatever works, in able to support all kinds of ancient browsers.

dennisgorelik 5325 days ago

It looks like Google Front page developers simply don't care about HTML compliance.

There is no reason to allow most of these errors other than coding sloppiness.

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.co...

rmc 5325 days ago

I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

The web would have died in stillbirth and it would never have grown to where it is now.

"Be generous in what you accept" (part of Postel's Law) is a cornerstone of what made the internet great.

XHTML had a "die upon failure" mode, and it has died, why do you think XHTML was abandoned and lots of people are using HTML5 now.

RobertKohr 5323 days ago

"...everything from tables nested to infinity..."

The irony of that statement on hacker news is pretty amazing. Have you looked at how the threads are rendered on this page. It is tables all the way down.

pilsetnieks 5325 days ago

Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.

aw3c2 5325 days ago

I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).

Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).

yesbabyyes 5325 days ago

Matt, how does a sitemap fit into this? If I'm not mistaken, you can suggest some refresh rate there, too. Do you take that into account?