Hacker News new | ask | show | jobs
by ivanhoe 1052 days ago
It's a pretty well-known fact for years that tarsnap is basically a one-man show, and yet Colin has managed to provide fantastic service so far. Sometimes having ppl who built the service also managing it is actually a big plus, compared to other services where you first have to fight through outsourced & underpaid support that's limited to template answers, only to finally get some "engineer" who got that job 2 months ago and is more clueless on their system than myself...
3 comments

And to be frank, I've seen plenty of mission-critical services at $bigco which may have had a team of engineers working on them, but the core functionality was maintained, understood, and supported by effectively one senior engineer. If anything went wrong, the supporting junior staff might have been able to fix reasonably simple stuff, but there was essentially one person who understood the system deeply enough to handle problems of any real significance.
Absolutely.

Early in my career, I became the second person able to support and operate a system that was public facing and responsible for billions of dollars of activity that mattered to many individuals and stakeholders. The entire team retired over a period of six months, after giving the folks in charge a year or more notice. After about 12 weeks, I was the sole guy, training a 4-5 new people.

We’re all probably using a service like this. As demonstrated by Twitter, well engineered systems can persist, even without proper care and feeding, until they don’t.

I hate to bring this up, but what about the bus factor? If Colin is physically unable to continue maintaining the service and something like this happens again, how will anyone be able to get their data out? It's not really a concern about the service Tarsnap provides today
There's an old Sys Admin saying (perhaps from Allan Jude of ScaleEngine) that goes something like "if your data doesn't exist in at least three places, it doesn't actually exist at all..."

That is to say, if Tarsnap is the only place you've keeping sensitive/important data, then you're "not doing it right" as a backup. Things happen... your hard drive can die suddenly, and a data center bursts into flames all on the same day.

I feel like ovh will never stop earing about this. This has been, frankly, a traumatic event for many sysadmins I believe, and one that was shared by many from the same source, which is quite different from the standard variation of "that time when I erased the production database" (looking at you gitlab, but also at myself!). I mean, at this point it's between a legend and a warning tale and I don't know what else to call it. A bad Wednesday probably.
> I feel like ovh will never stop earing about this.

To be fair, they deserve it a bit as they got up in flames twice .

Indeed, after the first fire, the geniuses over there collected all the UPS and batteries they could find from the DC and stored them all in a pile in a closed container... where they predictably bulged, failed, sparked and eventually triggered another fire after a couple days.

Why the scare quotes? I would expect any well-experienced power user to know a complicated system better than a fresh engineer two months into working on it, with no previous experience on the system. Especially if the power user is an engineer themself.