| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by crazygringo 1041 days ago

But as that linked guide explains, that's only relevant for sites with e.g. over a million pages changing once a week.

That's for stuff like large e-commerce sites with constantly changing product info.

Google is clear that if your content doesn't change often (in the way that news articles don't), then crawl budget is irrelevant.

3 comments

snowwrestler 1041 days ago

Google crawls the entire page, not just the subset of text that you, a human, recognize as the unchanged article.

It’s easy to change millions of pages once a week with on-load CMS features like content recommendations. Visit an old article and look at the related articles, most read, read this next, etc widgets around the page. They’ll be showing current content, which changes frequently even if the old article text itself does not.

link

crazygringo 1040 days ago

I'm pretty sure Google is smart enough to recognize the main content of a page, and ignore things like widgets and navigation. That's Search Engine 101.

link

snowwrestler 1040 days ago

Yes, of course, but that analysis happens after the content has been visited by the bot. It’s still a visit, and still hits the “crawl budget.”

link

lostmsu 1040 days ago

So they should stop doing this on pages that they are deleting now.

link

linkjuice4all 1041 days ago

It’s possible they examined the server logs for requests from GoogleBot and found it wasting time on old content (this was not mentioned in the article but would be a very telling data point beyond just “engagement metrics”).

There’s some methodology to trying to direct Google crawls to certain sections of the site first - but typically Google already has a lot of your URLs indexed and it’s just refreshing from that list.

link

codedokode 1041 days ago

To determine whether content changes Google has to spend budget as well, hasn't it? So it has to fetch that 20-years old article.

link

throw0101a 1041 days ago

> So it has to fetch that 20-years old article.

It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.

link

voramok 1041 days ago

There’s an obvious way this can be exploited. Bait and switch.

link

strken 1041 days ago

If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.

link

codedokode 1041 days ago

If the content is dynamic (e.g. a list of popular articles in a sidebar has changed), then the page will be considered "updated".

link

wise_young_man 1041 days ago

This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.

link