Hacker News new | ask | show | jobs
by mr_ndrsn 1041 days ago
This looks very cool!

Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.

2 comments

It's tough when there's a cat and mouse game to spoof your UA so you don't get blocked. I wish webmasters had better relationships with scrapers and could accept the realities that your data will be scraped no matter how much you try and stop it.
IMO, We should really just get rid of the user agent header altogether.
Yeah, that's good idea - I need to add that to my suggestions for how to implement this.
If you're scraping any significant amount of data (>500K), and depending on the frequency, you might also want to add etag/cache-control headers as well as accept-encoding, to save server bandwidth.

Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs

It should definitely be optional. I can only imagine some busybody PM insisting they block harmless scrapes.