Hacker News new | ask | show | jobs
by epanchin 845 days ago
To be brutally honest, it’s badly considered queries like yours that mean these services cannot be free.
3 comments

This comment is hilariously insane.

The idea that Google would give away BigQuery for free if only people would write better SQL queries.

There are plenty of APIs on the internet that are free that query a database for information. If queries are too expensive it's not viable to run for free.
Or, GCP could implement cost/resource/use limits, which would allow them to give away whatever they wanted for free without any concern about people over using it, while also allowing people to avoid shooting their own feet off.
I don’t disagree but how does that work exactly? When you hit the quota the query gets cancelled? That’s definitely already a feature of Redshift Spectrum with WLM. Does BigQuery offer something similar?
My first choice would be something like "this query will cost $13953, which exceeds your default cap of $100; please click the confirm button if you really want to run it". (The dollars could be CPU-minutes or whatever if you want to use resource based limits, which might play nicer with a free tier)

Edit: rereading, I think this is actually for non-interactive scripts, in which case yes it should just cancel the query

Edit 2: https://news.ycombinator.com/item?id=39447499 was kind enough to point out that the resource-based version of this might actually exist, which is nice

https://cloud.google.com/bigquery/docs/best-practices-costs#...

You can set the size limit for individual queries. Plus the custom quotas and everything.

Part of the problem is that the OP wrote a script with a loop. So say you set the limit to 50 GiB per query, but then write a script that runs a 49 GiB query 1000 times...

That type of batch process should be designed much more carefully to consider costs.

> ... the OP wrote a script with a loop.

Are you sure?

The article doesn't say anything about a loop, and the estimated usage by the Google responder makes it seem like the cost is from a single "SELECT *".

According to https://news.ycombinator.com/item?id=39447465:

> I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.

So looks like a loop of ((6 * 12) + 2) * #sites iterations with a full table scan every time.

Thanks, that does add further detail after all. :)
I've forgotten more Sql than most people ever learn. Time is also valuable and I make trade-offs. Should I spend hours (eg. $$$) to optimize or run a non-optimized query in the background for a different cost? Well, I didn't think the time/benefit/cost equation favored tuning, if I had known that I'd have spent time on tuning. If you offer something for "free" and then change the cost, and don't have any alerting mechanisms to inefficient queries, it's impossible to evaluate trade offs.
Can you post what a $14,000 SQL query looks like?

If nothing else, it can be an example in my SQL 101 course.

It's rarely interesting logic that makes it expensive. Because the per-query charge is not based on compute cycles but the amount of data scanned. This is sufficient:

`SELECT * FROM super_wide_table_with_lots_of_text WHERE NOT filter_on_partitions_or_clusters`

Select * is dangerous because it's a column store. You really need to look at the schema and select only the things you want. And when exploring the data it's important to use sane limits and pull from a single partition.

Here you go!

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

---

There's no LIMIT on it b/c I actually need all the results.

This would make a great educational blog post
> Time is also valuable and I make trade-offs.

I'd say!