Hacker News new | ask | show | jobs
by gwbas1c 1649 days ago
I'm not sure which question you're asking:

Is this about processes as a small team grows into a larger company? At a certain point your day-to-day software engineers will loose access to production systems. You will need to take away "ops" from the software engineers and have a dedicated "ops" team. Some places call the "ops" team "devops" for political reasons, especially if some founder has a chip on their shoulder about not having dedicated "ops." (Software engineers not touching production systems is industry-normal security practices.)

Is this about how to do the database migration safely? That really depends on your stack, business type, and scalability needs. Assuming you aren't running a hyper-scale product, and you're on a normal database, the easiest thing to do is to have "planned downtime" at a time when your load is lowest. Your business logic layer should return some kind of 5xx error during this period, and your clients / UI should be smart enough to retry after 1-2 minutes. If it's a minor update, (plenty of good advice in this thread,) the downtime should only be 1-2 minutes tops. (The only reason to plan and notify your customers is in case someone is trying to do something business critical and complains.) One thing you can do is "let it be known" that you have 5 minutes of planned downtime Monday night and Thursday night, and that's your window for anything that isn't an emergency.

Is this about the frequency of updates? This is a quality problem, and the only thing you can do is improve your testing and release process to catch these bugs sooner. This is the "growing up" that all small tech companies go through. As you grow, make sure to bring in some people who are mid-late career who've been through these changes. In short, you will need to introduce processes that catch bugs sooner, like automated testing and code coverage. You may find that your "test engineers" write nearly as much test code as your software engineers put into the shipping product.

1 comments

It's not just hyper-scale products that preclude having nice planned maintenance windows. Serving a handful of countries spread out across the globe is enough to throw the concept out the window. I had an application that started out US-only, and was built assuming that 11pm-7am Eastern was available to do big data imports, run reports, etc. But once a couple other countries got added, there was no longer any time of the day or night where that assumption held true. A lot of processes had to be re-architected because of that.