1) Built their own crawlers.
2) Using an Apache Nutch/Heritrix cluster in a colo facility.
3) Use 3rd party services like mixnode.