Hacker News new | ask | show | jobs
by twstdroot 4512 days ago
speaking from experience (https://github.com/bryanbrannigan/pastebin-parser) if you are just grabbing the pastes from the "latest" box you are missing a lot. To grab everything we actually had to create a distributed setup or else pastebin would start banning our IPs.
3 comments

I got around this with @dumpmon by simply playing nice with Pastebin. I discovered what limits they liked/didn't like, and adjusted accordingly.

I take my entries from the archive as they are available, and I don't believe I ever miss any.

I ran into that issue myself. Pastebin throttling is real. I was playing around with the idea of actually using the socks5 proxies gathered through scraping in order to retain modularity (and eliminate the necessity of multi-IP set ups which could easily get pricey).

It would be tough because I would have to check the health of each proxy prior to use (so that I don't miss out on request windows), but still an interesting concept to consider.

cheap VPS boxes from lowendbox.com work well for this purpose. we also had problems just processing the queue on busy days.
Distribution is what I ended up doing. :)