| Really? Or are you the one I should have refrained from feeding. But if you must know: First you need to collect a lot of content from the internet. From many different sites. With very different types of code structure. Broken html. More often than not behind some SPA JS code. Behind robots.txt files and bot protection efforts. So the first problem to solve would be building a crawler at scale. That is able to crawl anything your users might want to visit but don't know of yet. Then storage and retrieval. You need to store and update all this content your crawler collected. You need to enrich it with meta data and organize it for efficient retrieval. So that you can surface it to your users when they use your search engine. Indexing, structure, build g connections between content pieces. A lot of interesting things to think about. Then there is the front end. Make it easy to search, to refine. Surface relevant content for search queries. OH maybe I forgot, but you probably need to do a bit of engineering to make your system understand the users' search intent. This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts. Bigger than that. I would applaud you if done with orders of magnitude lower than Google. Anyone would. |
There are some problems that aren't as big as they seem. Parts of an SPA can't be reliably linked to anyway even if you find interesting text there, so you can just leave them out of the index.
Likewise, there isn't as great of a need to keep a fresh index as it may seem. The odds of a document changing is proportional to how frequently it changes. This is a bit of a paradox, where even if you crawl really aggressively, the most frequently changing documents will still always be out of date. Most documents are relatively stable over time. You can actually use how often you see changes to a document or website to modulate how often you crawl it.
The bad HTML is quite manageable. You really just need to flatten the document to get at the visible text. Even with really broken formatting, that's manageable.
The storage demands are also not as bad as you might think (most documents are tiny, sub 10 Kb), there are ways to lessen the blow on top of that. Both text and indexes can compress extremely well. Since you're paying for disk access by the block, you might as well cram more stuff into a block.
Most of the crawling concerns, in general, can be gotten around by starting off with Common Crawl (even if I do my own crawling, which also is finnicky but manageable).
> This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts.
Right, so shouldn't the question be how to find the documents that are even candidates for being search results? Most documents are not ever going to be relevant to any query ever. Get rid of that noise and your hardware goes a lot longer.
I'm running a search engine on consumer hardware out of my living room that can index 100 million documents. Go a bit higher budget than a consumer PC, and you've got 5 billion. That goes a long way.