Hacker News new | ask | show | jobs
by c0nrad 1950 days ago
I've been solo running/building a startup (csper.io) for over the last year, it just hit profitability a few months ago.

It's easier said than done, but if you can prevent issues in the first place, things will be much more enjoyable.

Some things that worked well for me:

  * GKE on GCP is pretty smooth. When there's a spike in traffic everything autoscales up, so I don't have to do anything. Nice observability, things just work. Just make sure to set container cpu/mem limits.
  * Along that same note, I use MongoDB Atlas which also autoscales very nicely. It autoscales both up and down very well, saving both money, and making my infra resilient
  * GCP has a lot of monitoring/alerting/dashboards that I take advantage of. Health checks around the world, easy integration of logs/metrics. I find structured logging (json), makes setting up alerts pretty easy
  * Good consolidated logging for when there is an issue you know exactly what went wrong
  * GCP also support application tracing which can make timing issues easy to debug (although it requires a bit of work to setup) (for example if you are missing an index on some db)
  * Automatic deployments (thanks to k8s), there's no checklist for doing a deploy, I just run a single make command. I can't screw that up
  * A staging environment that's a match of production. Plenty of times I've crashed staging, it's worth every penny. It also makes life much less stressful
  * Lots of tests. The tests aren't important for when I'm writing the code, but for months later when I make changes and want to know I didn't mess something else up. I find a good test suite can really help you sleep at night, specially if the test suite covers the critical paths
  * An easy way for users to contact you if there is an issue. No one is perfect, but being able to respond quickly is usually forgiven.
Also "stay-cations" are also pretty nice. I try to do one a quarter. I'm still at home if something does break, but I don't do any work for the week. Just load up a new video game and relax for a week. I call it my "monitoring" week.

Hope that helps!

1 comments

Can you expand on the "Health checks around the world" ?
https://cloud.google.com/monitoring/uptime-checks

If I remember correctly you can specify a bunch of regions for the health checks to originate from. It was super simple to setup (point and click) and it's nice that it's decoupled from the rest of my infrastructure. When there's a failure I get a notification.