Hacker News new | ask | show | jobs
by httparchive 846 days ago
I used this data when I was a grad student, back when there wasn't a fee for it, so I'm mostly concerned students will get hit with charges that will make it so they can't buy groceries.

The website has the Internet Archive logo on it, and it looks like a public resource for researchers, and it used to be free to use.

The point of this is for the HTTP Archive to make it clear this is a paid product from Google Cloud, not a "public service".

4 comments

That is pretty clearly documented in the setup instructions.

https://github.com/HTTPArchive/httparchive.org/blob/main/doc...

There are multiple notes about cost. In particular, this one stands out.

> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore

yeah, but don't "eat up your quota" seems rather tame, whereas "you can get billed $14k with no warning" is the truth.
It can be confusing, since the httparchive itself is provided for free by AWS S3 (where AWS is the one footing the bill).
So, you gave someone your credit card number without understanding how or what they were going to charge you for?
You have to give them a credit card in order to use the free tier, and they refuse to implement any features that would let you add safeguards (beyond setting an alert so you can find out after you've already spent the money).

Edit: I apologize; they did in fact add something beyond alerts: https://cloud.google.com/billing/docs/how-to/notify#cap_disa... ...which is less them implementing a feature and more telling you how to badly implement it yourself. I don't believe this changes the gist of my comment, but it is worth pointing out in the interest of precision.

Edit 2: Per https://news.ycombinator.com/item?id=39447499 , GCP actually does have a way to cap some resources. It still strikes me as the most "how can we technically claim to be supporting that feature request while still making it as easy as possible to spend more money than you intended to" but there it is.

Welcome to the cloud.

There are countless companies who specialise in managing cloud costs because of how difficult it is to know when and for what you are going to be charged. Especially for things like data transfer.

And by default they don't have a daily spending limit so it's very easy to see a major cost over-run at the beginning.

the data is a public service. the platform allowing you to query it is not.

you can print at a public library. each page costs a small amount. the printer in that case is the service. and if you print out millions of pages, you may owe hundreds of thousands of dollars.

slow down a bit, lest you blow off your other foot.