| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 0cf8612b2e1e 1084 days ago
	Even if that rule were true, why wouldn’t everything in the say, top NNN internet sites get an exemption? It is the Internet’s most hit content, why would it not be exhaustively indexed? Alternatively, other than ads, what is changing on a CNN article from 10 years ago? Why would that still be getting daily scans?

4 comments

progmetaldev 1084 days ago

Probably bad technology detecting a change. Things like current news showing up beneath the article, which changes whenever a new article is added. I've seen this happen on quite a few large websites. It might be technologically easier to drop old articles than the amount of time to fix whatever they use to determine if a page has changed. You would think a site like CNET wouldn't have to deal with something like that, but sometimes these sites that have been around for a long time have some serious outdated tech.

link

kenjackson 1084 days ago

That's a good point about the static nature of some pages. Is there any way to tell a crawler to crawl this page, but after this date don't crawl again, but keep anything you previously crawled.

link

em-bee 1084 days ago

the ads are different.

i am tracking rss feeds of many sites, and on some i get notifications for old articles because something irrelevant in the page changed.

link

bhandziuk 1084 days ago

CNET* not CNN. But everything you say is still true.

link