Hacker News new | ask | show | jobs
by fogx 378 days ago
esp. for image data libraries, why not provide the images as a dump instead? No need to crawl 3mil images if the download button is right there. Now put the file on a cdn or Google and you're golden
2 comments

There are two immediate issues I see with that. First, you'll end up with bots downloading the dump over and over again. Second, for non-trivial amounts of data, you'll end up paying the CDN for bandwidth anyway.
I work on the kind of big online scientific database that this article is about.

100% of our data is available from a clearly marked "Download" page.

We still have scraper bots running through the whole site constantly.

We are not "golden".