| All of this is a long series of solvable problems. I should know, I've dabbled in solving most of them. This is why I suggest actually taking a stab at it before you dismiss it as impossible. There are some problems that aren't as big as they seem. Parts of an SPA can't be reliably linked to anyway even if you find interesting text there, so you can just leave them out of the index. Likewise, there isn't as great of a need to keep a fresh index as it may seem. The odds of a document changing is proportional to how frequently it changes. This is a bit of a paradox, where even if you crawl really aggressively, the most frequently changing documents will still always be out of date. Most documents are relatively stable over time. You can actually use how often you see changes to a document or website to modulate how often you crawl it. The bad HTML is quite manageable. You really just need to flatten the document to get at the visible text. Even with really broken formatting, that's manageable. The storage demands are also not as bad as you might think (most documents are tiny, sub 10 Kb), there are ways to lessen the blow on top of that. Both text and indexes can compress extremely well. Since you're paying for disk access by the block, you might as well cram more stuff into a block. Most of the crawling concerns, in general, can be gotten around by starting off with Common Crawl (even if I do my own crawling, which also is finnicky but manageable). > This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts. Right, so shouldn't the question be how to find the documents that are even candidates for being search results? Most documents are not ever going to be relevant to any query ever. Get rid of that noise and your hardware goes a lot longer. I'm running a search engine on consumer hardware out of my living room that can index 100 million documents. Go a bit higher budget than a consumer PC, and you've got 5 billion. That goes a long way. |
What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?
> I'm running a search engine on consumer hardware out of my living room that can index 100 million documents.
That's extremely cool. I would love to know more. To me an impressive feat already.