| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Quatschmann 3544 days ago

I wrote a concurrent, super fast webcrawler for my job with Go (~300 LOC) to get data out of customer sites fast even when they have 1.5 million pages or more.

You can basically filter everything to get a .csv file in the end with the links for the given domain, the source for that links, link number, link depth, timestamp, HTTP Request Codes (200, 404 etc) that fits that filter.

Filters: Number of concurrent http(s) requests, max link number, max link depth, must include path, must include word(s), must exclude word(s), local or global search (for links with path, local means you only search for fitting links on that site and the found sites instead of crawling the whole homepage) etc.

It was my first Go project and I always wanted to do multithreading and Go made it so easy. Can't opensource the code because it's company property.

But damn is it fast if you let it run, one homepage didn't throttle me and I got up to 96 Mb/s (on my 100 Mb/s connection) with set to 2000 connections per second.

DDosed our office wifi a few times before I implemented a token bucket for rate limiting (and sometimes just for fun after that :>).

2 comments

jdc0589 3544 days ago

I did something very similar a few years ago, but in c#. I didn't do real rate limiting, just threading configuration and a configurable random sleep; but it got the job done. It was a super fun project to work on.

I can't believe that the open source options are so still so few and far between in this area. There are TONS of great tools for building crawlers, and there are tons of great crawlers built for mirroring a copy of websites. But, there are very few polished crawlers built for simply extracting metadata from pages and getting information about a site.

link

Quatschmann 3544 days ago

Yep, that's why I've build my own, the existing ones don't give out a list of the links or are super slow. A co-worker made the first one in Python but it was so slow that it took hours (6+ sometimes) to finish a site and I thought "you can do that faster".

Problematic are some sites that don't use <a href="asd.com"> tags because that's what my crawler is looking for.

C# & Elixier & Rust where the the other options I thought about and I want to build the same crawler on these languages (relative easy to do with ~300 LOC) to compare them for network / server / cli stuff but that has to wait till next year.

link

jdc0589 3542 days ago

the biggest headache with the c# implementation was the threading. A lot of the out-of-the-box threading structures (pools, etc...) have limitations you might not think about checking for; e.g. you can't set the number of threads lower than the CPU count on the machine with some of the official .net threadPool helpers; you can try, but it will just silently ignore you.

There is some super useful stuff too though that made it easy to write a generic extensible crawler. My implementation ended up supporting separately compiled plugins you could just dump in a 'plugins\' directory, which responded to events and had full ability to manipulate the output pipeline. Do-able in lots of languages, but c# has some formalized helpers around it that make it super easy.

link

thro1237 3543 days ago

Is it in github?

link