Hacker News new | ask | show | jobs
by jdc0589 3501 days ago
I did something very similar a few years ago, but in c#. I didn't do real rate limiting, just threading configuration and a configurable random sleep; but it got the job done. It was a super fun project to work on.

I can't believe that the open source options are so still so few and far between in this area. There are TONS of great tools for building crawlers, and there are tons of great crawlers built for mirroring a copy of websites. But, there are very few polished crawlers built for simply extracting metadata from pages and getting information about a site.

1 comments

Yep, that's why I've build my own, the existing ones don't give out a list of the links or are super slow. A co-worker made the first one in Python but it was so slow that it took hours (6+ sometimes) to finish a site and I thought "you can do that faster".

Problematic are some sites that don't use <a href="asd.com"> tags because that's what my crawler is looking for.

C# & Elixier & Rust where the the other options I thought about and I want to build the same crawler on these languages (relative easy to do with ~300 LOC) to compare them for network / server / cli stuff but that has to wait till next year.

the biggest headache with the c# implementation was the threading. A lot of the out-of-the-box threading structures (pools, etc...) have limitations you might not think about checking for; e.g. you can't set the number of threads lower than the CPU count on the machine with some of the official .net threadPool helpers; you can try, but it will just silently ignore you.

There is some super useful stuff too though that made it easy to write a generic extensible crawler. My implementation ended up supporting separately compiled plugins you could just dump in a 'plugins\' directory, which responded to events and had full ability to manipulate the output pipeline. Do-able in lots of languages, but c# has some formalized helpers around it that make it super easy.