Hacker News new | ask | show | jobs
by 0cf8612b2e1e 1038 days ago
Even if that rule were true, why wouldn’t everything in the say, top NNN internet sites get an exemption? It is the Internet’s most hit content, why would it not be exhaustively indexed?

Alternatively, other than ads, what is changing on a CNN article from 10 years ago? Why would that still be getting daily scans?

4 comments

Probably bad technology detecting a change. Things like current news showing up beneath the article, which changes whenever a new article is added. I've seen this happen on quite a few large websites. It might be technologically easier to drop old articles than the amount of time to fix whatever they use to determine if a page has changed. You would think a site like CNET wouldn't have to deal with something like that, but sometimes these sites that have been around for a long time have some serious outdated tech.
That's a good point about the static nature of some pages. Is there any way to tell a crawler to crawl this page, but after this date don't crawl again, but keep anything you previously crawled.
the ads are different.

i am tracking rss feeds of many sites, and on some i get notifications for old articles because something irrelevant in the page changed.

CNET* not CNN. But everything you say is still true.