|
|
|
|
|
by gojomo
5057 days ago
|
|
Nothing's wrong with Java. Commercial and research-quality crawlers of tens of billions of web resources have been written in Java for over a decade. Its threading/concurrency support and extensive well-optimized libraries make it easier for you to make your code fast over large datasets... if you're good at Java. (If you're not, there are plenty of ways to sabotage yourself.) But, Java's a bit verbose, has gaps in concise support for higher-level constructs, and sometimes the static typing gets in the way. So if you don't find those parts helpful -- some do -- and think your performance targets can be met with other later optimizations/design-choices/selective-reimplementations, stick with whatever more concise language you're good at. Or, use any of the more concise languages available on the JVM allowing intermixing of the occasional Java facility, like Jython, JRuby, Groovy, Javascript, Scala, Clojure, and others. (If efficiently handling massive numbers of concurrent net/IO streams is a priority, the recent JVM-based project vert.x may be of interest. I haven't used it for anything but toy tests, but it seems to combine some of the best-practices for maximum JVM IO throughput with a somewhat higher-level-language-agnostic top layer well-suited for servers/proxies/crawlers.) |
|
Although our use of Java as an implementation language was somewhat controversial when we be- gan the project, we have not regretted the choice. Java’s combination of features — including threads, garbage collection, objects, and exceptions — made our implementation easier and more elegant. More- over, when run under a high-quality Java runtime, Mercator’s performance compares well to other web crawlers for which performance numbers have been published.
source: [Mercator: A scalable, extensible web crawler (1999)](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.5...)