|
|
|
|
|
by worried_citizen
3822 days ago
|
|
Web crawling is just like most things: 80% of the results for 20% of the work. It's always the last mile that takes the most significant cost and engineering effort. as your index and scale grow you bump into the really difficult problems: 1. How do you handle so many DNS requests/sec without overloading upstream servers? 2. How do you discover and determine the quality of new links? It's only a matter of time until your crawler hits a treasure trove of spam domains. 3. How do you store, update, and access an index that's exponentially growing? Just some ideas. |
|
So crawling a billion of links in X number of hours is not trivial but not that hard specially with cloud infrastructure like AWS, it's just a matter of a good enough implementation and how much money one wants to spend on it.