Hacker News new | ask | show | jobs
by aleph- 2916 days ago
So piggybacking on this, I have a similar story to tell. We had a nice young startup, infra entirely built out on Google Cloud. Nicely, resiliently built, good solid stuff. Because of a keyword monitor picked up by their auto-moderation bot our entire project was shut down immediately, wasn't able to bring it up for several hours, thank god we hadn't gone live yet as we were then told by support that because of the grey area of our tech, they couldn't guarantee this wouldn't keep happening. And in fact told us straight out that it would and we should move.

So maybe think about which hosting provider to go with, don't get me wrong I like their tech. But their moderation does need a more human element, to be frank all their products do. Simply ceding control to algorithmic judgement just won't work in the short term if ever at all.

14 comments

I’m starting to favour buying physical rack space again and running everything 2005 style with a light weight ansible layer. As long as your workload is predictable, the lock in, unpredictability, navigation through the maze of billing, weird rules and what-the-fuckism you have to deal with on a daily basis is merely trading one vendor specific hell for another. Your knowledge isn’t transferable between cloud vendors either so I’d rather have a hell I'm totally in control of and of which the knowledge has some retention value and will move around vendors no problems. You can also span vendors then thus avoiding the whole all eggs in one basket problem.
Hybrid is what you are looking for. Have a rack or two for your core and rent everything else from multiple cloud vendors, integrated with whatever orchestration you are running on your own racks (K8s? DC/OS? Ansible?).
Or just two DCs in active/active.

Still works out cheaper for workloads than AWS does even factoring staff in at this point.

AWS always turns into cost and administrative chaos as well unless it is tightly controlled which in itself is costly and difficult the moment you have more than one actor. GCP probably the same but I have no experience with that. Very much more difficult to do this when you have physical constraints.

Two man startup, perhaps but I think the transition should go:

VPS (linode etc) for MVP, colo half rack, active/active racks two sites then scale out however your workload requires.

More importantly, there is a wealth of competent labor in the relatively stable area of maintaining physical servers (both on the hardware and software side). The modern cloud services move fast and break things, leading to a general shortage of resources and competent people. As a business, even if slightly more expensive initially, it makes more sense to start lower and work up to the cloud services as the need presents itself.
You can federate Kubernetes across your own rack and one or more public cloud providers.
You can but that’s another costly layer of complexity and distribution to worry about.

One of the failure modes I see a lot is failing to factor in latency in distributed systems. Mainly because most systems don’t benefit at all from distribution and do benefit from simplification.

The assumption on here is that a product is going to service GitHub or stackoverflow class loads at least, but literally most aren’t. Even high profile sites and web applications I have worked on tend to run on much smaller workloads than people expect. Latency optimisation by flattening distribution and consolidating has higher benefits than adopting fleet management in the mid term of a product.

Kubernetes is one of those things you pick when you need it not before you need it. And then only if you can afford to burn time and money on it with a guaranteed ROI.

Sure. The idea is that you get the benefits of public cloud and cost savings of BYO hardware for extra capacity at lower cost. Of course, you're now absorbing hardware maintenance costs as well. I haven't seen a cost breakdown really making a strong case one way or the other, but my company is doing it anyway.
Have you actually done this, or are you repeating stuff off the website? Because everyone I've talked with about kubernetes federation says it's really not ready for production use.
The approach we have taken is to create independent clusters with a common LoadBalancer.

Basically, the LB decides which kubernetes cluster will serve your request and once you're in a k8s cluster, you stay there.

You don't have the control-plane that the federation provides and a bit of overhead managing clusters independently, but we have automated the majority of the process. On the other hand, debugging is way easier and we don't suffer from weird latencies between clusters (weird because sometimes a request will go to a different cluster without any apparent reason <-- I'm sure there's one, but none that you could see/expect, hence debugging).

My people's time is more important than your complex system.

Ha. It's in process. Not ready yet. I'll report back if we fail miserably.
Federation v1 is legacy now. The new architecture is called MultiCluster and designed to work on top of K8S rather than having a leader cluster: https://github.com/kubernetes/community/tree/master/sig-mult...
That's exactly what we are thinking too. We've looked HARD into AWS/GCP/Azure, but for all the reasons you mentioned we don't want to go that route. Owning the entire stack is so much cheaper, both money and time wise.
Have you looked at OCI bare metal shapes? [1] Oracle Cloud provides the server, and you control the stack end to end (including the hypervisor).

If you run into an issue, send me a note and I will get someone to reply to your issue.

1. https://cloud.oracle.com/compute/bare-metal/features

This needs more upvotes
I can tell a similar story with Amazon MWS, where even if we had access to "human support", it felt like talking to some bad ML, not understanding what we were saying. Ultimately that start up was disbanded, never violating any rule they had, but flagged because of a false positive, and we couldn't even prove we didn't violate anything because we didn't even go live yet. It felt Kafkaesque, punishing one of a myriad possible intents due to malfunctioning ML, with no recourse.

Maybe support just needed to satisfy their quota of kicked out companies for the month, who knows?

Lol. That's definitely a possibility.
Is it only me or does it seem if you are not a "famous" person that has a lot of public visibility and is able to create pressure through a tweet or blog post you are lost, no number to call, no mail to write. Over the years I saw a lot of similar stories, youtube or in general "google accounts" blocked for no clear reason and no way to contact somebody to solve the issue... kinda scary...
> Because of a keyword monitor picked up by their auto-moderation bot

Can you elaborate on that? What do they monitor with the moderation bot?

The point is that anyone could fall into that category when laws change.

Imagine you're running a cosplay community, and all of a sudden all your content is being deleted because the SESTA/FOSTA bill gets passed in a country where your "cloud" happens to reside in: https://hardware.slashdot.org/story/18/03/25/0614209/sex-wor...

"because of the grey area of our tech"

"told us straight out that it would and we should move"

Sounds shady. I bet this would make more sense if OP explained what his company actually does.

Exactly -- there is a lot the OP isn't telling us. Maybe Google was right to shut them down.
I'd rather my cloud provider err on the permissive side. Preemptively shutting suspicious things down without an external complaint seems a bit much…
Well, there are all kinds of grey area stuff. One fairly obvious example is various security services, which have a wide variety.

Not everything is outright "likely to get banned" (eg pron things). ;)

Yup, agree. OP was probably doing something against the terms. Care to provide details?
Well, "grey" can mean a lot of things when you are talking about the same company that moderates Youtube.
I know they specifically ban cryptocurrency mining on their free credit / tier. Even called out on their public product pages.

I assumed they could tell that via CPU usage with they already monitor for quotas.

+1 - I'm curious as well.
This occasionally happens with gsuite users as well. Businesses lose access to all their email and documents.

Good times.

I've gotta whole heartily disagree. I've never encountered this on GCE. I run a DevOps consulting company and for standard EC2/machines I much prefer GCP. It's not even close. AWS for the most part lacks any or little user experience testing on UI's and developer interfaces. AWS region specific resources are a nightmare, billing on GCP with sustained use and custom machine types is vastly superior. Disks are much easier to grok, no provisioned IOPS, EBS optimized, enhanced networking hoopla.

By chance are you located out of the United States? These are not downtime issues, but anti-fraud prevention and finance issues.

I've noticed that over the last few years it's become increasingly difficult to do things with US based services (especially banking) if you are outside of the US. And this goes double if you are a US citizen with no ties to the States other than citizenship. Americans as a general rule have never been terribly adept at anything international; banking, languages, or even base geography. We have offices in Cambodia and Laos and I have been told by more than one US based-service/company that Laos is not a real country. I suppose they think the .la domain stands for Los Angeles :) We are looking to set up an office in Hong Kong or Singapore and use that to deal with Western countries. But we're a small not-for-profit operation and HK and Singapore are EXPENSIVE.
What gray area are you in?
> the grey area of our tech

Cryptocurrency?

Blockchain for sure
because of the grey area of our tech

The nature of the tech in question seems important in this story.

I am really curious, what was the business?
Thanks for sharing, I thought maybe it was a one off - it helps to avoid similar issues and luckily there is plenty of cloud competition.
Was it porn or cryptos?
Sounds like "we were doing something sketchy, got caught, but somehow it isn't our fault".
It can happen to you too: you get hacked, hackers run arbitrary code in your account
If your cloud services account was hacked, you'd most likely be thanking Google or Amazon for stopping the services.
stopped yes, deleted the project if the photo id of the credit card account holder cannot be reached in 3 days might be an over-reaction though.

I hope there is a possibility to put a backup contact person / credit card so organisations can deal with people going on vacation or being sick or whatever.

IMHO this should be nicely documented as any other technical material you get to learn about the cloud product when you create an account (e.g. important steps to ensure your account remains open even in case of important security breaches, yadda yadda it's possible we'll need a way to prove that you are you yadda yadda, this can happen when yadda yadda, be prepared, do yadda yadda)

I agree that it seems like an over-reaction. But on an account with intense usage, a single credit card on file, no backup, and a fraud warning it does seem very suspicious.

AFAIK, Google Cloud credit card payments are processed through Google Pay, which supports multiple credit cards, debit cards, bank accounts, etc.

Ideally, in this case the company shouldn't be using the CFOs credit card, but entered into a payments agreement with Google, receiving POs, invoices and so on, including a credit line.

Never set up a crucial service like you'd set up a consumer service.

yes that's a very good description of the best practices that sadly many companies are not really following.

In many situations the "right thing" must be explained, otherwise when people fail to get it they can argue that wasn't really the right thing after all (sure that's ultimately because they just want to deflect the blame from themselves; so don't let them! clearly explain the assumptions under which anti-fraud measures are operating so people cannot claim they didn't know)

> Nicely, resiliently built, good solid stuff.

Erm ... no, evidently not?

Projects in the context of GCP can encompass all the necessary infrastructure to build a highly available service using standard practices. There's no indication anywhere from GCP themselves that a project could be a domain of failure. If asked, I doubt they would consider it as such.

A prudent person might consider a cloud provider to be a domain of failure and choose a multi-cloud option, which would probably be the correct way to address this resiliency issue. However, that's not really an appropriate approach for an early stage startup, where availability is generally not that much of a concern.

In other words: It wasn't resiliently built stuff.

Is an exploding car safe because it is built by an early stage startup?

Just because you decide that implementing resiliency isn't a good business decision for some early stage startup, doesn't magically make the product resilient, it just isn't and that may be OK.

There are many options to choose from for implementing resiliency, it could be having multiple providers concurrently, it could be having a plan for restoring service with a different provider in case one provider fails, it could be by setting up a contract with a sufficiently solvent provider that they pay for your damages if they fail to implement the resiliency that you need, whatever. But if you fail to consider an obvious failure mode of a central component of your system in your planning, then you are obviously not building a resilient system.

Edit: One more thing:

> There's no indication anywhere from GCP themselves that a project could be a domain of failure. If asked, I doubt they would consider it as such.

Then you are asking wrong, which still is your failure if you are responsible for designing a resilient system.

If you ask them "Is a complete project expected to fail at once?", of course they will say "no".

That's why you ask them "Will you pay me 10 million bucks if my complete project goes offline with less than one month advance warning?", and you can be sure you will get the response to the problem that you are actually trying to solve.

> A prudent person might consider a cloud provider to be a domain of failure and choose a multi-cloud option, which would probably be the correct way to address this resiliency issue. However, that's not really an appropriate approach for an early stage startup, where availability is generally not that much of a concern.

If you replace "multi-cloud" with "multi-datacenter" (in the pre-cloud days), this premise is fairly unassailable. In those same days, applying it to "multi-ISP", it becomes more arguable.

Today, though, the incremental cost (money and cognitive) of the multi-cloud solution, even for an early startup, doesn't seem like it would be high enough to make the notion downright inappropriate to consider.

I'd even argue that if a cloud provider makes the lock-in so attractive or multi-cloud so difficult that that's a sign not to depend on those exclusive services.

choose a multi-cloud option

The economics don’t work out if you are trying to do this with just vanilla VMs across AWS, GCP and Azure and managing yourself. You either do it the old fashioned way renting rack space and putting your own kit in, or you make full use of the managed services at which point - by design - you are locked in.