Hacker News new | ask | show | jobs
by mbiondi 3416 days ago
The only reason they are still in existence is due to a chance backup they took for a tangential reason. From the sounds of it, their solution is held together with bubble gum, some tape and lots of hand waving. Being in 160 different locations probably doesn't help much either.
3 comments

I'm pretty sure most solutions on the internet consist of bubble gum, some tape and lots of handwaving. Gitlab's screwup is hardly unique, even if it was very public.

It's difficult and expensive to build and maintain a solid system, and even if you want to, time and financial pressures often just don't let you, on top of the issue of just communicating the need for solid engineering, as it usually only becomes apparent when the problems start occurring.

Unless I'm misreading things, the reason they only lost 6 hours of data instead of 24 hours was a chance backup, but there was never an existential crisis here.

Downtime happens to pretty much every service out there. In this case the company was incredibly forthright and so we can make fun of their stupid mistakes, but really most mistakes are stupid when you look at them -- when you make thousands of decisions a day, some of them will seem silly in hindsight. We just never learn about most of them.

I'm not defending them -- but that is the norm.

I had a customer once in the 90s take a 30 hour outage that cost them nearly $6M in fines because some asshole put a budget freeze on anything related to cleaning, including tape drive cleaner carts. The dopey ops guy kept using one tape on multiple drives, making them do nothing.

I could personally rattle off a dozen stories like this at late stage startups, Fortune 10 and .gov.

The only reason many businesses are alive is luck and reliable SAN.