Hacker News new | ask | show | jobs
Ask HN: Two people laid off due to known Google/Elastic billing bug. Now what?
11 points by william-at-rain 1983 days ago
Facts:

* GCP + Elastic Cloud have an acknowledged (in writing) bug that is charging us 10x the proper amount. It started early on New Years Day and affects clients using Elastic Cloud purchased through Google Cloud's Marketplace.

* We are a bootstrapped startup with ~$1M ARR (launched product in May 2020). Still growing and just recently breakeven.

* Google has tied up 100% of our cash with their auto-bill. Credit cards are full ($32k in charges in a week).

* I laid off two people because I couldn't pay them. Everyone else is due a paycheck at the end of the month.

* Google is still charging us 10x normal rates every day. Elastic is "working to fix the issue."

* We can't get off Elastic Cloud because we can't increase quotas for machines until... we pay our GCP bill.

I always thought the stories on here about Google were 1 in a million... but here we are.

"Support" on both sides is a stone wall. What should I do to save this company and all our jobs?

2 comments

Old crusty greybeard here.

If you're being charged 10x @32k, that's 3k/week 'normal' charges. This makes me think that your infrastructure could be hosted for about $3k/year, redundant, with bare metal and minimal downtime.

And that's with renting bare metal. Not even the cheaper option of owning your own.

There are lots of paths here, but I doubt you want to leave GCP, even though you're now seeing the perils of:

1) Zero real support from places like AWS/GCP/etc

2) Zero real control, or ability to easily migrate, once you tie-in

That said, as you believe you can do nothing to move/mitigate, I believe this is a logical appeal to GCP to 'fix things', by shaming them with this horrid bug, via this post.

And they should be shamed! GCP, as AWS, is charging an insane premium for such services.

However, there isn't much for us to do, accept agree that the entire scenario is absurd. And my post serves that, as I hate to see so many young companies have all their runway eaten up by absurd, 100x to 1000x costs in the cloud, even outside of these bugs.

And bare metal is how you control your own destiny. It's also how you remain un-vendor locked into cloud 'extras'. You can also run your own 'cloud' on bare metal too.

NOTE: my email is in my profile. Suggest (if you want to remain unannounced) you create some sort of throw-away for yours, if you want to enable out-of-band communication on topics like this.

Thanks.

> If you're being charged 10x @32k, that's 3k/week 'normal' charges

Close - about $2,500 a week.

I spent the morning adding SSDs to a bare metal HP DL360p with 32 cores and 192GB RAM on it.

Elastic fixed their bug this morning - so the bleed has stopped (confirmed on GCP dashboard).

We still have $32k tied up, so I bought the hardware out of pocket w/ personal savings. We started ingest to the VMs (new single-server, multi-node ES deployment) as well.

> You can also run your own 'cloud' on bare metal too.

I'm thinking about buying three more servers and throwing them in our rack. Everything will be slower, but we won't have the same version of existential crisis in our future. That adds so much OTHER risk (and expense) as well... Multicloud is looking appealing here (via, ironically, K8s all the things).

Not clear on why everything will be slower. In my experience, when you run your own hardware, you have guaranteed responsiveness to processing power, RAM speed, I/O.

AWS rates, as an example, are so insanely high, that you can literally buy a server with 100x the processing power for the same price point as an AWS instance. That is, hardware purchase amortized over 2 year or so.

In terms of hosting with bare metal, you can easily procure bare metal, non-managed, quite cheaply on a monthly basis. Here you're often paying 1.5x or 2x the price of raw hardware, yet you have someone managing all of the infrastructure for you.

(Think of this as leasing costs, without having the hardware left over afterwards)

The key is to have redundancy, just as you would in the cloud. However, the more you scale, the less that redundancy costs, as you do not require 2x the hardware, merely spare hardware to swap in during a failure event.

EG, 2 DB servers means one must be redundant, 10 servers therefore means 10% of your capacity is for redundancy, not load.

And you can easily spread hardware out over multiple datacenters.

Worst case, you can always target said hardware for development, saving on costs for testing/devs, if you ensure that deployment is not specific to one cloud provider. You can easily roll out, for example, KVM and spawn whatever containers you wish inside each VM.

Although, I'll get all greybeard on you here, and suggest the merits of containerization don't overcome the reduced reliability due to additional complexity, the security issues due to unvetted upstream images, and the list goes on.

But that's a whole other conversation, and you likely have few free CPU cycles to think on such things right now.

EDIT:

I'm not sure what industry you're in, but most companies I deal with tend to have very consistent load. That is, day in and day out, load is relatively consistent, with little variance between December 3rd and May 5th.

Or, at least, peak load doesn't suddenly go off the charts.

If you're not consumer facing, but instead corporate facing, you often have a much better eye into load.

For example, you know more load is coming due to monthly subscription numbers.

So two things:

- as you literally pay 1/100th the cost for bare metal, scaling is typically not even a factor here. You save absurd amounts of cash, because you can already scale to 100x the load for the same price.

100x the load is an insane amount of scale.

You're almost certainly going to run into non-hardware issues at that scale. DB provisioning/size. Scale and scope of data storage. Code issues. Internal networking issues.

- in the old days, devs would spend more time optimizing code, for example, I've literally reduced DB load by astonishing numbers, not even joking here, 1000x load reduction due to query rewrites, 100k times faster due to code rewrites

By the way, I realise I have no optics here into your setup. Just consider this COVID lockdown-free, bored out of my mind advice. :P

[I work at Elastic]

can you shoot me an email at markw at elastic dot co and I will see what I can assist with