| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bpchaps 3402 days ago

In the Clojure side of things, I recently used this [1] to scrape/parse ~4m pages in a few hours. It's very plug-and-play, but maintains a pretty decent amount of extensibility. Parsing using Tika turned out to be extremely useful.

While it's on topic.. anyone have any other recommendations for web crawlers? I'm particularly interested in finding unique identifiers (phone numbers, emails) and their contexts on gov-owned websites for a project.

[0] https://github.com/junjiemars/itsy

1 comments

plantpark 3402 days ago

Great crawler, Thanks for your share.

link

bpchaps 3401 days ago

Agreed. It's probably save me over 100 hours of work in the past two months.

link