Hacker News new | ask | show | jobs
by mNovak 557 days ago
"The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links"

I don't know what to say... I'm amazed they kept this up so long, but this really should never have been the game plan.

I also had some data science hobby projects around covid; I got busy, lost interest after 6 months. But the scrapers keep running in the cloud, in case I get motivated again (anyone need structured data on eBay listings for laptops since 2020?), that's the beauty of automation for these sorts of things.

1 comments

Do you just pay the bill for the resources indefinitely?
I'm not the person you're asking, but I maintain a number of scraping projects. The bills are negligible for almost everything. A single $3/mo VPS can easily handle 1M QPS (enough for all the small projects put together), and most of these projects only accumulate O(10GB)/yr.

Doing something like grabbing hourly updates of the inventory of every item in every Target store is a bit more involved, and you'll rapidly accumulate proxy/IP/storage/... costs, but 99% of these projects have more valuable data at a lesser scale, and it's absolutely worth continuing them on average.

Inbound data is typically free on cloud VMs. CPU/RAM usage is also small unless you use chromedriver and scrape using an entire browser with graphics rendered on CPU. We're taking $5/mo for most scraping projects
I paying < $0.50 a month, and that's primarily driven by S3. For the scraping itself I'm using lambda, with maybe minutes of runtime per day.