Hacker News new | ask | show | jobs
by coreyp_1 866 days ago
Rule #1. BQ is not a standard database. If you use it like one, it will cost a fortune.

Rule #2. BQ is amazing for being able to churn through and analyze massive amounts of data, and can very well be the best option in some use cases.

Rule #3. Letting "just anyone" run queries is as dangerous as casually handing a credit card to your drug-addicted cousin. Just wait until you get the bill!

Rule #4: Partition and cluster your data wisely. You don't have indexes.

Rule #5: Duplicate data. Throw all of the normal forms out the window. Storage is cheap, computation is expensive.

Rule #6: BQ is not meant to be used like MySQL. It's "spin up" time is too slow, but you would be hard-pressed to beat its performance on truly large data sets.

My perspective: One of our customers has a database growing by 17 gigs a day. One of them. There's several on the same scale. Yes, it's necessary. Another instance: One of our customers spent $8k in one month because limits were not properly placed on the account and we didn't catch it until the bill came. We monitor better now. A different instance: We had a dev trying to optimize a query, and they spent $250 in queries to get the cost down from $50/query to $15/query. Most of the time, though, our queries are only pennies.

Now that I've written all of this out, I feel like I need to record a video about it. There's not a lot of BQ info aside from the marketing fluff put out by "teh Google".

3 comments

OP here. 100% agreed on your analysis. Thanks for chiming in. Coming from the Postgres world, this was very counter intuitive for me. I am still not convinced if a database should charge 1000s of $s due to lack of an index (cluster). It could either create the index automatically or explicitly (on the face) warn the user that this can be expensive or else slow.
BigQuery seems to suffer from being overly internal Googly.

A bizarre conversation I witnessed between the BigQuery team and my company at the time (a major customer):

Company: "We need to be able to see our consumption and a breakdown of our bill"

BQ team: "Oh, yeah. We can see how that would be useful. We should probably build that..."

Like, this was a GA product without any thought given to self-serve billing visibility.

I realize billing is usually the last thing bolted on, but I'd expect some basics to be in place before the product ships.

We use BQ quite extensively there are a number of billing tuning options which are not that well documented.

1. for some it will make sense to move to pricing based on CPU time per query vs billing on scanned TB of data. This can be done through the commits in UI.

2. there is option to have storage billed on logical or physical bytes. If you have data with a lot of duplication (enums, customer ids etc) then physical billing can be a lot better option. Last I looked this was only available through CLI setting in dataset and you may need to ask Google to include you in their beta. We lowered our billing with 30% for storage.

I try to keep an eye on GCP release notes to find things like the physical vs logical billing.

Use BQ to crunch the larger set into a smaller subset that you need and ram that into PG/MySQL.

Used this to power a +$30M revenue affiliate platform tracking.

Thanks for these rules, as a budding engineer, this is very insightful. Will look forward to your video.