| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wenbert 3426 days ago
	"Distributed" as in using proxies? Where do the requests come from when I scrape a page?

1 comments

Nimsical 3426 days ago

I guess it's not truly distributed in that sense. StdLib uses AWS Lambda, which have widely different IPs and I believe they're multi-region.

I haven't had issues hitting a wall with getting caught doing any scraping. But then again, I haven't done it at a 10k/pages/sec rate or anything like that.

link

wenbert 3426 days ago

Thank you! This is very interesting.

I have wanted to implement something like this one - ie: Lambda doing the downloading of the page itself.

I have wondered how it would work with very strict sites like Yelp - limits similar to what you would get in their API (so doesn't make sense not to use their API).

What are your stats like if you don't mind? How much people are using it and much are getting blocked (404 or 500 after 1000 requests, etc.)?

Edit: Is it possible to use my own credentials for AWS?

link

Nimsical 3426 days ago

I don't track that information on the function – but I really should!

Based on StdLib's dashboards – a bunch of folks have been using it per month with a steady pace of a few 100 scrapes a day type of thing.

We've been using it internally for quite a while now.

And as far as I know, StdLib doesn't allow you to use your own AWS credentials. They have their own gateway and a bunch of stuff on top of Lambda that makes the whole experience a lot easier and more powerful (e.g. 128MB limit on payload vs 5MB for Lambda)

link

AznHisoka 3426 days ago

Most popular sites just block any ips in AWS. Which is why proxies are the only reasonable choice for scraping.

link