Hacker News new | ask | show | jobs
by nothttparchive 845 days ago
This website makes it seem like this “public” dataset is for the community to use, but it is instead a for-profit money maker for Google Cloud and you can lose tens of thousands of dollars.

Last week I ran a script on BigQuery for historical HTTP Archive data and was billed $14,000 by Google Cloud with zero warning whatsoever, and they won’t remove the fee.

This official website should be updated to warn people Google is apparently now hosting this dataset to make money. I don’t think that was the original mission, but that’s what it is today, there’s basically zero customer support, and you can lose $14k in the blink of an eye.

Academics, especially grad students, need to be aware of this before they give a credit card number to Google. In fact, I’d caution against using this dataset whatsoever with this new business model attached.

5 comments

The real issue here is that you didn't quite understand what BigQuery was when you pressed the button.

What it is, roughly, is a publicly-accessible data supercomputer. If you lost $14k in a blink of the eye, then I would think you consumed at least $4k of Google's actual resources -- maybe $7k. Maybe more. That thing can move some serious data, and you apparently moved around over 2PB.

Google bears some significant responsibility for not making the cost transparent to you, it's true. But on the the other hand, don't they bear some significant credit for making such an awesome power available to a lowly peon with a credit card?

This happens because Google hides the query cost behind its abstracted "TBs scanned" (for their data format, not even open-source so it's hard to estimate in advance) or even worse "slots" mechanism. Only a fraction of people try to understand how much these slots cost and most of them are the people who got an unexpected bill after using BigQuery and became more aware of how the product works.

If GCP would return the query cost in the API and show it directly in the console when you run a query, it would be much easier for their users but unfortunately, it's not Google's interest for obvious reasons.

Exactly, even after seeing the issue I can't make heads or tails of what the hell a "TBs scanned" is relative to row counts, etc. Likewise, it seems to place a lot of assumptions on knowing what tables include - and on a dataset you didn't build yourself how can you know the tables are optimized to lower your costs? Hell, how can you even know what the costs are?
"TBs scanned" is the number of tebibytes of stored data that the system had to scan to serve your query. This is how BQ is billed, in the on-demand model.

The console shows you this number (in very small letters) after you have entered the query but before you press go. In the on-demand billing model, which is what you were using, you can multiply this number by $6.25 to understand your query cost, exactly.

It's a design that's hostile to new customers, I agree. But it is comprehensible.

There should be a cost estimate displayed prominently by default, and an option to turn it off for power users who know what they're doing (but keep the current less-prominently displayed amount of data estimate).
Do you run httparchive, or did you make your username "httparchive" just because it's the subject of your post?
+1

If the latter... I'm not sure that it's explicitly against the rules, but coopting a name of something as your handle just to complain about it is in poor taste and probably should be.

> The worst part is you posting this to hackernews under the username ‘httparchive’ to make it look like it was the httparchive posting this themselves.

This was the last comment in TFA, so it seems like they just used it because it was the topic...

Dang changed the username to "nothttparchive"

https://news.ycombinator.com/item?id=39451976

Did the cost estimate calculator provide an inaccurate estimate?

https://cloud.google.com/bigquery/docs/best-practices-costs

Estimate query costs

BigQuery provides various methods to estimate cost:

Use the query dry run option to estimate costs before running a query using the on-demand pricing model. Calculate the number of bytes processed by various types of query. Get the monthly cost based on projected usage by using the Google Cloud Pricing Calculator.

When I use the BQ interface, it estimates the bytes for each query in real time before I run it, does that turn off if the query is too big? I guess that isn't directly a cost estimate, but if I saw hundreds of TB I'd think twice before hitting "Run"...
> Google is apparently now hosting this dataset to make money

Public datasets are hosted for free by Google (Amazon has a similar program) to take the burden off public projects.

You didn't pay for the data, you paid for the query you ran against it.

Well sure, but how do you query the data they're hosting for free without using google services?
Well, sure. But it is convenient to have lots of sample data. Also you get the first TiB per month free in BQ.

Also note that anyone can make a dataset available for public use, where they pay the storage and the consumer pays the compute. The official Google datasets are just curated and maintained by Google itself.

If you're going to make a throwaway account to criticize a website, you shouldn't use that website name as your username. That makes you look like a troll even if you have legitimate complaints.