Hacker News new | ask | show | jobs
by bpchaps 3402 days ago
In the Clojure side of things, I recently used this [1] to scrape/parse ~4m pages in a few hours. It's very plug-and-play, but maintains a pretty decent amount of extensibility. Parsing using Tika turned out to be extremely useful.

While it's on topic.. anyone have any other recommendations for web crawlers? I'm particularly interested in finding unique identifiers (phone numbers, emails) and their contexts on gov-owned websites for a project.

[0] https://github.com/junjiemars/itsy

1 comments

Great crawler, Thanks for your share.
Agreed. It's probably save me over 100 hours of work in the past two months.