|
I wrote a concurrent, super fast webcrawler for my job with Go (~300 LOC) to get data out of customer sites fast even when they have 1.5 million pages or more. You can basically filter everything to get a .csv file in the end with the links for the given domain, the source for that links, link number, link depth, timestamp, HTTP Request Codes (200, 404 etc) that fits that filter. Filters: Number of concurrent http(s) requests, max link number, max link depth, must include path, must include word(s), must exclude word(s), local or global search (for links with path, local means you only search for fitting links on that site and the found sites instead of crawling the whole homepage) etc. It was my first Go project and I always wanted to do multithreading and Go made it so easy. Can't opensource the code because it's company property. But damn is it fast if you let it run, one homepage didn't throttle me and I got up to 96 Mb/s (on my 100 Mb/s connection) with set to 2000 connections per second. DDosed our office wifi a few times before I implemented a token bucket for rate limiting (and sometimes just for fun after that :>). |
I can't believe that the open source options are so still so few and far between in this area. There are TONS of great tools for building crawlers, and there are tons of great crawlers built for mirroring a copy of websites. But, there are very few polished crawlers built for simply extracting metadata from pages and getting information about a site.