|
|
|
|
|
by marginalia_nu
1543 days ago
|
|
Seems the first of these can be solved by reducing the scope. Do you really need a data center to run a search engine? Overall it seems very rare anyone ever considers this an engineering problem. Really, what's stopping you from running a search engine? |
|
But if you must know:
First you need to collect a lot of content from the internet. From many different sites. With very different types of code structure. Broken html. More often than not behind some SPA JS code. Behind robots.txt files and bot protection efforts.
So the first problem to solve would be building a crawler at scale. That is able to crawl anything your users might want to visit but don't know of yet.
Then storage and retrieval. You need to store and update all this content your crawler collected. You need to enrich it with meta data and organize it for efficient retrieval. So that you can surface it to your users when they use your search engine. Indexing, structure, build g connections between content pieces. A lot of interesting things to think about.
Then there is the front end. Make it easy to search, to refine. Surface relevant content for search queries.
OH maybe I forgot, but you probably need to do a bit of engineering to make your system understand the users' search intent.
This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts.
Bigger than that. I would applaud you if done with orders of magnitude lower than Google. Anyone would.