Hacker News new | ask | show | jobs
by darth_avocado 844 days ago
I am sorry but this seems to be more of a “TLDR; didn’t read;” situation. The http archive clearly mentions that the data is available for offline processing or for querying online on BQ. And in the “Getting started” section of the instructions, it is mentioned multiple times on how BQ will charge you. And even if it wasn’t mentioned anywhere, it’s a little presumptuous to assume a tool for processing data will not charge you money for literally processing TBs of data again and again.

> Note: BigQuery has a free tier that you can use to get started without enabling billing. At the time of this writing, the free tier allows 10GB of storage and 1TB of data processing per month. Google also provides a $300 credit for new accounts.

> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore

> When we look at the results of this, you can see how much data was processed during this query. Writing efficient queries limits the number of bytes processed - which is helpful since that's how BigQuery is billed. Note: There is 1TB free per month

https://github.com/HTTPArchive/httparchive.org/blob/main/doc...

4 comments

This comment reminds me of unsafe pedestrian crosswalks in car-centric cities.

Sure, a crosswalk may have an extensive system to warn drivers of pedestrians, but that doesn't change the fact a driver hits a pedestrian there at least once a month. It only has to happen once to ruin someone's life.

For cloud providers, the obvious solution is hard budget limits. Ask people to set a hard budget limit before they get the opportunity to drown themselves in debt. Free up some workload off of the support team in the process.

Hard budget limits change the process to avoid these charges almost entirely. Warnings only inform a few people that they're aware the process lands people in debt, and to please use the broken process correctly to avoid the severe financial consequences.

Yes, sure there's stuff I could have done better, and stayed up all night looking at the fine print. But that's not the point - this is *warning* to other people who see the Internet Archive logo, the words "public", and for some dumb reason also trust Google. I'm hoping this doesn't happen to others, I learned a costly lesson.
I'm on OP's side - even if I knew I'd be paying to run some queries against this dataset, I never would have thought it could reach 5 figures in such a short time. And you can't argue that the billing is straightforward. The "Getting Started" guide for the HTTP Archive doesn't even describe what indexes are available/commonly used for limiting the scanned rows.
If google provides a credit limited to $300 for new accounts, then it has the ability to limit spend.

It should make this available.

To be fair: I'm sure they don't provide this limit to make money, because this is a rare case, but to avoid the far more common case of established business going offline because someone forgot to update a limit.