Hacker News new | ask | show | jobs
by crazygringo 1041 days ago
But as that linked guide explains, that's only relevant for sites with e.g. over a million pages changing once a week.

That's for stuff like large e-commerce sites with constantly changing product info.

Google is clear that if your content doesn't change often (in the way that news articles don't), then crawl budget is irrelevant.

3 comments

Google crawls the entire page, not just the subset of text that you, a human, recognize as the unchanged article.

It’s easy to change millions of pages once a week with on-load CMS features like content recommendations. Visit an old article and look at the related articles, most read, read this next, etc widgets around the page. They’ll be showing current content, which changes frequently even if the old article text itself does not.

I'm pretty sure Google is smart enough to recognize the main content of a page, and ignore things like widgets and navigation. That's Search Engine 101.
Yes, of course, but that analysis happens after the content has been visited by the bot. It’s still a visit, and still hits the “crawl budget.”
So they should stop doing this on pages that they are deleting now.
It’s possible they examined the server logs for requests from GoogleBot and found it wasting time on old content (this was not mentioned in the article but would be a very telling data point beyond just “engagement metrics”).

There’s some methodology to trying to direct Google crawls to certain sections of the site first - but typically Google already has a lot of your URLs indexed and it’s just refreshing from that list.

To determine whether content changes Google has to spend budget as well, hasn't it? So it has to fetch that 20-years old article.
> So it has to fetch that 20-years old article.

It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.

There’s an obvious way this can be exploited. Bait and switch.
If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.
If the content is dynamic (e.g. a list of popular articles in a sidebar has changed), then the page will be considered "updated".
This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.