Hacker News new | ask | show | jobs
by ikinsey 1595 days ago
I've been developing web crawlers for the better half of a decade. They are used for various purposes such as cataloging sentiment/bias in news media, finding new tv shows to watch, or mapping out the Tor hidden service directory.

Currently, I am writing a web crawler application framework in golang. Always looking for help or new ideas on what to crawl next!

Emails welcome, check my profile.

3 comments

I had posted this on Ask HN (https://news.ycombinator.com/item?id=30096235#30096410) a few days ago.

Your crawler perhaps could be customized to crawl and publish an index of all available Progressive Web Apps. A naive way would be to check for sites that have a PWA App manifest file in their root folder.

Let me know if you are interested in collaborating.

This looks very possible. It would only require a two modules for analysis and frontier management. It would be great to collaborate on something like this!
That's awesome. Thank you.

Please do think about it for the next couple of days and I will drop an email to you in the next 48 hours (I live in India). We can then decide how to take it forward.

Just curious: How long does an end-to-end crawl take and how much resources does the process consume (in terms of hardware/mem etc.,)?

What's your opinion about YaCy: https://yacy.net?
YaCy is a great tool! Haven't used it all that extensively since 2012. Very good for setting up simple crawls with minimal configuration or for crawling intranets.
Tips for information on getting into web crawling?