That's only the first portion of the problem, and in my opinion not the most difficult. Anyone can write a web crawler since its essentially,
while(links) {
get(link)
}
With a little xargs or parallel magic + wget you can suck down billions of pages with little effort.
There is more to it then that to keep things fresh but its not that hard to just kick it off now and then. The thing that's difficult is taking that data and turning it into an index able to span multiple machines.
If it was that easy DuckDuckGo or someone else would have taken that data and done something with it by now.
while(links) { get(link) }
With a little xargs or parallel magic + wget you can suck down billions of pages with little effort.
There is more to it then that to keep things fresh but its not that hard to just kick it off now and then. The thing that's difficult is taking that data and turning it into an index able to span multiple machines.
If it was that easy DuckDuckGo or someone else would have taken that data and done something with it by now.