|
|
|
|
|
by boyter
4695 days ago
|
|
Very and probably not very good (Compare Gigablast to Google as an example of why its hard). Not to take anything away from Common Crawl but crawling is often one of the easier things to build when creating a search engine. A crawler can be as simple as for(listofurls) {
geturl;
add urls to listofurls; } Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling. Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory. If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-... |
|
What I heard about a smaller search engine was that web crawling is usually augmented with some manually added rules for various sites to prevent spoiling the database. Not a trivial task at all.
Doing queries is IMHO algorithmically much better understood, because it's a constrained problem. But getting information extracted out from the real world, with all the PHP and HTML "hackers", not so easy.