|
|
|
|
|
by bpchaps
3402 days ago
|
|
In the Clojure side of things, I recently used this [1] to scrape/parse ~4m pages in a few hours. It's very plug-and-play, but maintains a pretty decent amount of extensibility. Parsing using Tika turned out to be extremely useful. While it's on topic.. anyone have any other recommendations for web crawlers? I'm particularly interested in finding unique identifiers (phone numbers, emails) and their contexts on gov-owned websites for a project. [0] https://github.com/junjiemars/itsy |
|