Hacker News new | ask | show | jobs
by mikeortman 850 days ago
The dataset IS free to download, but running a query against it on Google Cloudis what costs $$$. BigQuery is basically renting servers to scan through the data, which is the fee
2 comments

The complaint says there should be a warning that processing fees can be high. Go to the front page and check out the links. Nothing really about cost. Someone follows that path and 14k gone without a word about it. That's the path that people are sent down from the website. It explicitly talks about using BQ for analysis.

A simple "running queries over the whole dataset can cause significant costs due to the size of the dataset" should be enough. And I think that's a valid and fair point.

The whole part of accusing Google should just be ignored.

The setup instructions mention what you’re asking.

https://github.com/HTTPArchive/httparchive.org/blob/main/doc...

I can't even find "cost" on that page. Only one rather tiny side note that you could get past the free tier quota.

I don't think that's a proper warning on costs.

> The whole part of accusing Google should just be ignored.

I don't know. Google could trivially solve this problem by imposing an opt-out warning on potentially expensive queries.

"It looks like your query might cost $14k. Are you sure?"

But money.

It probably wasn't a single query costing $14k, but more like 1k costing $14.
Given how small the dataset is there is no query that justifies a $14k charge.

AWS charges $27/hour for a server with 3TB of memory. Enough to run the queries in memory.

BQ charges you based on the volume of data being scanned. I think this is a situation which involves scanning the whole dataset again and again without fully understanding how it works. I’ve worked with much larger datasets on BQ (petabyte scale) and managed to not spend more than $1000 in an hour. Also, BQ tells you how much data will be processed BEFORE you run the query, which makes it easier to understand the cost implications.

Again, you could fit the whole dataset in memory in an EC2 instance and do your thing.

It's easy to make an enormous query by joining to other data (or to the same data), or reading a lot of data.

A regex query on response_bodies would churn through 2.5TB of data every time it's run.