Hacker News new | ask | show | jobs
by tylerl 2613 days ago
You'll see this effect from every search engine. They have no choice, there are a lot of sites with an infinite number of pages; so instead the number of pages they store per site depends on how important your site is, and they try to store your top N pages by relative importance.
1 comments

I'm not sure I buy that they have no choice. For websites that literally have an infinite number of (dynamically generated) pages, sure, they could detect that and exclude them. But we're talking about unique, static pages here. And they don't even have to store the whole page, just the indexed info. I read this as, they could, but it's cheaper not to, and most people won't notice anyway.
I'm not sure that's true. How can one automatically determine whether a page is unique or static? As a trivial example, a URL path that accepts arbitrary strings and hashes them generates unique, immutable pages, but obviously cannot be crawled in its totality.
> How can one automatically determine whether a page is unique or static?

They crawled it for years and it never changed? It is a blog post.

The person you are replying to said "unique immutable pages", by definition you would be able to crawl these for years and they would never change. [1] is a site that contains all possible 3200 page books with the ability to consistently index content as an example.

[1] http://libraryofbabel.info/About.html

So, this issue isn't about sites that Google can't crawl in totality, it's about sites where they discard pages that they have crawled. If a site has less than [large number] of pages, there would be no need to worry about it; they could just index them all. But it's not like their indexing algorithm is operating naively either—for sites with a lot of pages, there's plenty of analysis they can do to determine things like whether the pages contain coherent text and other such things, to determine whether the information is worth indexing.

In the case described here though, these pages were actually indexed at one point; Google just decided that once they reached a given age, they were no longer necessary to remember. They could have simply decided to keep them instead.