|
|
|
|
|
by fosstrack
911 days ago
|
|
Second that. Anyone interested in studying web crawler tech should definitely take a look at Heritrix. I had used it extensively when it was still in 2.x. They got so many things right about writing well-behaved and fault tolerant crawlers. Plus the code is very modular, and extensible, if you know some Java. The other popular option then was Apache Nutch, but it had too much hadoop baggage. |
|
Kind of a pity since it has the effect of making things that could be very easy, such as reading and writing parquet files, much harder than it needs to be in Java.