Hacker News new | ask | show | jobs
by swvjeff 3077 days ago
That's really not relevant to this article. The author is not talking about crawling and indexing the entire web (although he mentions the "whole web" once, that's clearly not what he means). He is wondering why old pages -- pages that used to be in Google's index -- are no longer showing up in SERPs even when using appropriately-targeted long-tail queries.
2 comments

For the same reason 'Joe Infinity page 1234567' would not be found anymore. Google thinks its not relevant enough to keep it indexed. Yes, it is debatable what is relevant enough and what isn't. But everyone who indexes 'the web' has to decide what to keep and what not. Nobody can store 'everything'.

Also it's not as easy as just keeping everything that ever was in the index in there. Then searchengines would link to noexisting urls most of the time. Most URLs have a short lifespan. Links rot pretty fast.

I completely agree with you, but your initial argument was that "Joe Infinity page ∞" wouldn't be indexed because Google cannot index every viable page on the internet. That is true, and Google will certainly set limits on what pages is crawls and what pages it indexes. However, in this instance the articles were crawled and they were indexed and they were relevant at one point in time. But google decided to remove them from SERPs for some reason or another (age, lack of traffic, etc).
On the contrary, it is very relevant.

I run a search engine. What I save and think matters can be expressed in a very definite dollar value.

Old pages in practical reality equals "whole web", since the index isn't getting trimmed, and exponential cost.