Hacker News new | ask | show | jobs
by stairlane 868 days ago
We've recently been struggling with BigQuery and various other GCP services (i.e. CloudRun and Pub sub) as it feels like utilizing these services can feel like a minefield of gotcha's. With their documentation, and limits/quotas being spread all over the place. It's given us more problems than solutions thus far; albeit that could very well be our fault.

Has anybody else had this experience? Or are we just doing it wrong?

This is not intended to be a rant, just curious.

8 comments

I've generally found something similar- lots of gotchas, but also some very useful products.

The best way I've found to approach it is to treat GCP as something that has to be evaluated at an individual service level. It's great if you're on one of their expected workflows/golden paths, and you can get lucky with a good fit if you aren't, but they seem to have a lot of unspoken assumptions and limits baked in that might or might not align with your use case.

Disclaimer: My use cases are pretty unusual from talking to our account rep, so this might be over-fitting to weird data.

At the beginning of my career, I incurred some hundreds of dollars in losses with BigQuery and Google Cloud Function. The problem with these services is that they are easy and intuitive enough for a beginner to use, but a nightmare to maintain.
That's nothing. Wait until you incur $150k of Lambda costs in a day!
Did it happen to you? sorry to hear!
Project ended up saving 10x that per year, so wasn't a huge deal. Quickly rewrote it to run as a traditional server, though.
Invocation loop?
nah I was using it for benchmarking downstream services and the benchmark "worked" in that it overloaded the downstream services and accidentally had the lambas waiting too long for responses (we had to wait, to simulate real load and connections).

It was originally estimated like 10k or something per test which was approved at the time (had like 3 level of management all down my neck for getting it out, hence using lambda originally).

We did deliver, just needed one more sprint to rewrite it as a distributed system on servers. ;) Moved to like 20 machines w/ 128gb of ram that we could spin up as needed (testing millions of events a second, system in NodeJS!)

I‘ve had almost the same experience. First I was super impressed how easy it is to get data into BigQuery and retrieve it using their IDE.

But really soon I noticed the slow startup … simple queries took too long (1.2 sec vs milliseconds in a traditional database)

Then I learned a lot about BigQuery views. That helped a little.

At some point I simply wanted to export data. New Google tools needed to be learned: Cloud Storage, Data Flow.

After 18 months of using BigQuery on roughly 850 million rows, I switched back to a traditional database.

I'm glad you learned that lesson for less than $1k. I think everyone who's ever worked with large amounts of data in BigQuery has a story like that, and sometimes the number is six or seven digits.
Most of the time, it feels a little bit embarrassing, but the cost is just a very small part of your regular salary and overall operating cost. If your boss hits you with this, they don't have the correct perspective and priorities.

My record is $20k and it raised some eyebrows. But it was not really a mistake, just a sub-optimal backfill.

The data was filling a need for making appropriate business decisions, and compared to all the money lost by business developers making investments on a hunch, this was a very small bump in the road.

I agree that some GCP services are better than others.

I’ve never used Pub/Sub or Cloud Run, but have been quite happy with BigQuery and GKE.

BigQuery has more footguns than GKE in my experience, but that’s perhaps because I have a lot more experience with GKE and know how to avoid those footguns. To me at least it’s understandable enough to say More Nodes is More Money but completely non-straightforward to say that this query I wrote is going to scan the data in a new and expensive way. Am I doing it wrong?
> To me at least it’s [...] non-straightforward to say that this query I wrote is going to scan the data in a new and expensive way. Am I doing it wrong?

When you put a query in the BigQuery console, it'll tell you "This query will process ??? MB when run" at the top right.

So if you code all your queries interactively in production (which is what everyone else is doing anyway) it's not too hard to keep an eye on.

Are you using slots (https://cloud.google.com/bigquery/docs/slots)? If you aren't, I'd highly recommend you switch. My guess is that it would make your costs much more predictable (it did for us).

Note that this is not the default! :-)

At my previous workplace we had a mix of bare metal (most services), AWS (one service), Digital Ocean (misc), and GCP (BigQuery), and eventually moved almost entirely onto GCP, retaining just a bit of Digital Ocean stuff.

We found that all of these had significant caveats that required careful planning. We had a few instances of runaway AWS costs due to basically not knowing enough about AWS and we had to be careful to only use the "good" AWS products, Digital Ocean never had runaway costs but they did keep turning off production services because our use-case was not one they were familiar with (dev machines, off-site backups). Bare metal was a minefield, we found we couldn't reliably run Prometheus because it ate SSDs. As for GCP, it did require understanding the pricing and it was possible to shoot yourself in the foot with things, but no more than anything else.

There are going to be gotchas everywhere. Overall we had a great experience with GCP, to the point that the company has remained on GCP post-acquisition by another company who were mostly on Azure.

We have been using all those GCP products and more without any significant problems.

But I do agree, there are some gotchas. PubSub examples: Duplicated messages, shitty DLQ implementation (in my opinion), some developers had improper error handling which lead to to infinite resends of messages (because they nacked message on error), etc..

However, I think the scaling and setup weighs up for all of that. You just need to specify a topic and subscription, and then you don't really have to care about resources or scaling at all, and that is SUPER nice. Also, PubSub is stupidly cheap in comparison to any other similar product, at least that I know of.

Yes, I've found that you need to scrutinize the documentation, quotas, SKUs and billing statements quite closely, and you need to test everything before you run production at scale. I've seen unexpected billing due to an operation or resource using a different SKU than expected which didn't qualify for an account's discount, for example.
Very much. We ended up writing our own queries to try and figure out where the costs were coming from in BQ. Ultimately we decided to offload as much as possible to a self-managed ClickHouse cluster.
Yes, try and find the pause button for a push pub sub for example...