|
|
|
|
|
by jdc0589
3501 days ago
|
|
I did something very similar a few years ago, but in c#. I didn't do real rate limiting, just threading configuration and a configurable random sleep; but it got the job done. It was a super fun project to work on. I can't believe that the open source options are so still so few and far between in this area. There are TONS of great tools for building crawlers, and there are tons of great crawlers built for mirroring a copy of websites. But, there are very few polished crawlers built for simply extracting metadata from pages and getting information about a site. |
|
Problematic are some sites that don't use <a href="asd.com"> tags because that's what my crawler is looking for.
C# & Elixier & Rust where the the other options I thought about and I want to build the same crawler on these languages (relative easy to do with ~300 LOC) to compare them for network / server / cli stuff but that has to wait till next year.