Hacker News new | ask | show | jobs
by hexadec0079 3658 days ago
Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am. If they call everyone at 3am without a failure the student could have prevented, that does not teach anything other than how to answer a phone. Instead, teach them to properly engineer and document their solution such that they aren't called at 3am.

This seems like a waste of a good night's sleep to me.

3 comments

> Wait, this seems to ignore the fact that with good change controls and sound code, products do not just fail at 3am.

Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.

If you aren't occasionally up at 3 AM fixing unexpected outages, then either you haven't deployed a project that requires uptime or you're paying someone else to do it for you.

> Because network outages never happen, disks never fail or fill up, memory is never an issue, programs always deal with only the data they were expected to, products never do more traffic than expected, and all infrastructure software ships completely bug-free.

several of these things are exactly the type of things wehre the whole idea of devops just falls apart. If a disk fails what use is a dev vs a good ops person?

The issue is not with problems, but with the arbitrary nature of this as a learning tool. Getting a page to fix an issue that you forsaw and prevented is dumb. Teach them to make resilient systems rather than get out of bed.

I have never gotten up at 3am to fix an outage because outages do not impact systems that widely. If it is a network problem, let the network team do their job. All of your other problems are solved with multiple HA/ load balanced servers, monitoring, and proper testing.

I am the one behind this part of the curriculum, I've been SRE/DevOps for the last past 5 years of my life. The project that the article refers to is about uptime, better uptime means better grade.

We partnered with companies such a PagerDuty, Wavefront so that they have tools to help them keeping their uptime, we are guiding them to put everything in place so that their website/servers never go down. We never call them at 3AM and we actually never call them at all. However we do have challenges where we simulate hardware failure, traffic spike...

Just to make things clear: the goal is not to wake up students are 3 AM but to get them ready for production and this involve being on call and possibly getting paged at 3AM.

My hope would be that they somehow engineered a failure on all the systems that they would have to debug at 3am
So the students engineered themselves to fail or the instructors gave them a terrible system that is bound to fail? Shit going sideways is a fact of life, but you can at least let people have the chance to prevent the problems before they exist.

While there is always someone else's mess you have to support, that does not seem like a great teaching moment to me.