Hacker News new | ask | show | jobs
by wrnr 1453 days ago
How long did it take him to do this setup, a year you say, and that is impressive? I am not trying to be cute here, my question comes from a genuine place of curiosity. I've love to learn to spin-up a system like that, but from the tech/sales talks I see I am made to believe this can be done in a day. Expectation management is important, if people say ops is just a solved problem then I expect this to take very little time and to be easy to learn. Maybe I am learning the wrong thing here, and should do learn Helm or something more high level.
6 comments

It took a year, but that was somewhat on the side of also building an OpenAPI based Web service, the gRPC based workers. So, it wasn't just the infrastructure stuff. If I were to estimate how much time for just the infrastructure and devops tooling, then two months. It's been up and running with less than 15 minute downtime over the course of two years.

I do consider this impressive. And, to be clear, I wouldn't say this is because of a "super-developer". In fact, he had no prior k8s experience. But rather that there are thousand upon thousand of infrastructure hours devoted to the helm charts, often maintained by the people who develop the services themselves. It is almost mind boggling how much you get for almost free. Usually with very good and sensible default configurations.

In my precious work place, we had a team of 5 good engineers purely devoted to infrastructure, and I honestly believe that all five would be able to spend their time doing much more valuable things, if k8s had existed.

As for whether or not such devops solutions could be done in a day. Hm. I don't know. These things should be tailored to the problem. If you've done all of this a few times, then maybe you can adjust a bunch of charts that you are already familiar with and do what took a couple of months and impressed me, in a couple of weeks. It's a lot more than just "helm install. Done", that goes into architecting a scalable solution. Implementing monitoring, alerting and logging. Load testing stuff. Etc.

Sounds like a waste of months that could have gone into building product by choosing simpler operational tech
That's seems like a very negative take in my opinion. This 'simpler operational tech' would still need to be able to scale, correct? If you think that there is a good and easier way to deploying 10-15 services, all of which can scale, and all of it defined in rather neat code, to be anything but "simple operational tech", then I believe you are confusing "solving a complex problem", with "simplifying the requirements of a complex problem". The latter of which has been stripped of many important features. K8S isn't anything magic, but it certainly isn't a bad tool to use. At least not in my experience, though I've heard of horror stories.

That does remind me that when that employee started, the existing "simple operational tech" was in fact to SSH into a VM and kill the process, git pull the latest changes, and start the service.

The only way you can solve the actual problem (not a simplified one) would in my opinion either be k8s or terraform of some kind. The latter would mostly define the resources in the cloud provider system, most of which would map to k8s resources anyways. So, I honestly just consider k8s to better solve what terraform was made for.

I'm sure the "simpler operational tech" meets few requirements for short disaster recovery. Unless you have infrastructure as code, I don't think that is possible.

>That's seems like a very negative take in my opinion. This 'simpler operational tech' would still need to be able to scale, correct?

Premature optimization is a top problem in startup engineering. You have no idea what your startup will scale to.

If you have 1,000 users today and 5 year goal of 2,000,000 users, then spending a year building infrastructure that can scale to 100,000,000 is an atrociously terrible idea. A good principal can setup a working git hook, circleci integration, etc capable of automated integration testing and rather close to ci/cd in about a weekend. Like you can go from an empty repo to serving a web app as a startup in a matter of days. A whole year is just wasteful insanity for a startup.

The reality for start-ups running on investor money with very specific plans full of OKRs and sales targets is very different: you need to be building product as fast as possible and not giving any fuck about scale. Your business may pivot 5 times before you get to a million users. Your product may be completely different and green-fielded two times before you hit a million users.

I can't imagine any investor being ok with wasting a quarter of a million+ and a year+ on a principal engineer derping around with k8s while the product stagnated and sales had nothing to drive business -- about as useful as burning money in a pit.

You hire that person in the scale-up phase during like the third greenfield to take you from the poorly-performing 2,000,000 user 'grew-out-of-it' stack to that 100,000,000+ stack, and at that point, you are probably hiring a talented devops team and they do it MUCH faster than a year

If you have a website with 1000 users today and product is going to be re-designed 5 times, it's probably best just to use sqlite and host on a single smallish machine. Not all problems are like this however.
Yeah to be honest, I run a k8s cluster now for my saas. But about 4 times more expensive then my previous company I ran on a VPS.

And scaling is the same that VPS I could just scale the same way. Run a resize in my hosting company panel. (I dont use autorescal atm)

Only if I would hit about 100x times the nrs I would get the advantage of k8s, but even then I could just split up customers into different VPS.

CI / CD can be done good and bad with both.

And in practice K8S's a lot less stable. Maybe because I'm less experienced with K8S. But also because I think its more complex.

To be honest k8s is one of those dev tools that has to reinvent every concept again, so it has it's own jargon. And then there are these ever changing tools on top of it. It reminds me of JS a few years ago.

>This 'simpler operational tech' would still need to be able to scale, correct?

Only if "scaling" is the problem that your startup is solving.

Any startup that knows what their product is and are done with PoCs, should be able to deal with the consequence of succeeding, without failing. Scaling is one of those things that should be in place before you need it. In our case, scaling was a main concern.
> In our case, scaling was a main concern.

and ... you might be justified in that concern. However... after having been in the web space for 25+ years, it's surprising to me how many people have this as a primary concern ("we gotta scale!") while simultaneously never coming close to having this concern be justified.

I'm not saying it should be an either/or situation, but... I've lost count of how many "can it scale?" discussions I've had where "is it tested?" and "does it work?" almost never cross anyone's lips. One might say "it's assumed it's tested" or "that's a baseline requirement" but there's rarely verification of the tests, nor any effort put in to maintaining the tests as the system evolves.

EDIT: so... when I hear/read "scaling is a main concern" my spidey-sense tingles a bit. It may not be wrong, but it's often not the right questions to be focused on during many of the conversations I have.

Just keep it simple, and if you take off scale vertically while you then work on a scalable solution. Since most businesses fail, premature optimisation just means you're wasting time that could have gone on adding more features or performing more tests.

It's a trap many of us fall into - I've done it myself. But next time I'll chuck money at the problem, using whatever services I can buy to get to market as fast as possible to test the idea. Only when it's proven will I go back and rebuild a better product. I'll either run a monolith or 1-2 services on VPSs, or something like Google cloud run or the AWS equivalent.

Scaling something no one wants is pointless.

> good and easier way to deploying 10-15 services

Why are so many micro-services needed? Could the software be deployed in a more concise manner?

Not getting into the whole monolith-vs-services arguments. In both cases, complexity of deployment is part of the cost of each option.

I should perhaps have clarified, but the 10-15 are not self maintained services. You need nginx for routing and ingress, set up cert-manager and other ingress endpoints are automatically configured to have TLS, deploy prometheus, which comes with node-exporter and alert-manager, deploy grafana.

So far, we're up at 6 services, yet still at almost zero developer overhead cost. Then add the SaaS stack for each environment (api, worker, redis) and you're up at 15.

those are basically all things that can be outsourced and not for much money (cloudflare etc)
Sometimes it's faster to implement certain features in another languages and deploy it as microservice instead of fighting your primary language/framework to do it. Deploying microservices in k8s is as easy as writing a single yaml file.
Makes sense, though 15 different languages?

I am not privy to the details of the case, but a rule-of-thumb I heard once is that if it's far enough from your core, a SaaS can be used (obviating the whole question), and if it's part of the core, start by developing it as a separate functionality before moving it to another service.

In a lot of cases it's pattern abuse. I'm dealing with this all the time. People like to split things that can work perfectly as one whole, just for the sake of splitting it.
for example lambda (not microservices, running mini monoliths per lambda function)

yes by simple I mean covering high availability requirements, continuous deployment, good DORA measures - not simple as in half-baked non-functional operations (such as manually sshing to a server to deploy)

Ah, I see. Well, lambdas are also a nice tool to have, but it certainly do not fit for all applications (same as with k8s). I'd also point out that lambdas replace a rather small capabilities of k8s, and the type of systems you can put together. You would end up needing to set up the rest either through a terrible AWS UI or terraform. Neither of which I find to simplify things all that much, but perhaps this is a matter of taste.

In our case, the workers were both quite heavy in size (around 1 GB), and heavy in number crunching. For this reason alone (and there are plenty more), lambdas would be a poor fit. If you start hacking them to keep them alive because of long cold starts, you would lose me at the simple part.

>If you start hacking them to keep them alive because of long cold starts,

this is a few years out of date of platform capability, just fyi

How would you possibly know one way or the other?
the heck?
Having very recently done this (almost, another dev had half time on it) solo, It's not _too_ terrible if you go with a hosted offering. Took about a month/month and a half to really get set up and has been running without much of a blip for about 5 months now. Didn't include things like dynamic/elastic scaling, but did include CD, persistent volumes, and a whole slew of terraform to get the rest of AWS set up (VPCs, RDS, etc). I'd say that it was fairly easy because I tinkered with things in my spare time, so I had a good base to work off of when reading docs and setting things up, so YMMV. My super hot take, if you go hosted and you ignore a ton of the marketing speak on OSS geared towards k8s, you'll probably be a-ok. K8s IME is as complex as you make it. If you layer things in gradually but be very conservative with what you pull in, it'll be fairly straightforward.

My otherhot take is to not use helm but rather something like jsonnet or even cue to generate your yaml. My preference is jsonnet because you can very easily make a nice OO interface for the yaml schemas with it. Helm's approach to templating makes for a bit of a mess to try and read, and the values.yml files _really_ leak the details.

With 1YoE I did most of that in about 3 months. Had a deadline of 6 months to get something functional to demonstrate the proposed new direction of the company, and I did just that. If I were to do it today I could probably rush it to a week, but that would mean no progress on the backend development that I was doing in parallel. A day is probably doable with more on-rails/ batteries included approaches.

Not because I'm amazing, but there's a frankly ridiculous amount of information out there, and good chunks of it are high quality too. I think I started the job early January, and by April I had CI/CD, K8s for backend/frontend/DBs, Nginx (server and k8s cluster), auto-renewing certs, Sentry monitoring, Slack alerts for ops issues, K8s node rollback on failures, etc.

The best way to learn, is to do. Cliche, but that's what it really comes down to. There's a fair few new concepts to grasp, and you probably have picked some of these up almost by osmosis. It sounds more overwhelming than it is, truly.

The problem is never spinning things up, it's in maintenance and ops. K8s brings tons of complexity. I wouldn't use it without thinking very carefully for anything other than a very complex startup while you're finding product-market fit.
You can get a majority of those things "running" in few days. If you don't want it to fall over every other day, then you need to have a ton of ancillaries which will take at least several months to set up, not to mention taking care of securing it.
Use a managed k8s cluster (eks, aks or gke). Creating a production ready k8s on vms or baremetal can be time consuming. Yes, you can do lamdba, serverless, etc. but k8s gives you the same thing and is generally cheaper.
It's actually pretty easy to do these days, even on bare metal servers. My go to setup for a small bare metal k8s cluster:

- initial nodes setup: networking configuration (private and public network), sshd setup (disallow password login), setting up docker, prepping an NFS share accessible on every nodes via private network

- install RKE and deploy the cluster, deploy nginx ingress controller

- (optional) install rancher to get the rest of the goodies (graphana, istio, etc). These ate a lot of resources though, so I usually don't do this for small clusters

Done in a single afternoon.