Hacker News new | ask | show | jobs
by fosstrack 911 days ago
Second that. Anyone interested in studying web crawler tech should definitely take a look at Heritrix. I had used it extensively when it was still in 2.x. They got so many things right about writing well-behaved and fault tolerant crawlers. Plus the code is very modular, and extensible, if you know some Java. The other popular option then was Apache Nutch, but it had too much hadoop baggage.
1 comments

Hadoop is a bit of a nuisance in this general corner of Java. It's got a propensity for integrating deeply with cluster adjacent technology in a way that is very difficult to root out.

Kind of a pity since it has the effect of making things that could be very easy, such as reading and writing parquet files, much harder than it needs to be in Java.