Hacker News new | ask | show | jobs
by forapurpose 2997 days ago
Yes, I saw the same thing. Even when I've had people reviewing logs, I make the date prominent and train them to check it first. BVS's data is far more critical.

(I learned from a burnt hand: We engaged a non-technical but reliable user to check the daily backup log for errors and report any to us. One day we needed the backup and discovered it hadn't run at all for many weeks. Ouch. I asked the user; she said she indeed checked the logs daily and they were fine. She was right: She was seeing the log from the last backup, unchanged every day. My fault entirely, not hers: We should have anticipated the date problem, and we should have utilized someone technically literate enough to understand what they were reading - in this case, someone who would recognize an 'obvious' problem such as the numbers of files and bytes not changing. And we should have tested our backups more often, but that old lesson almost isn't worth mentioning.)

1 comments

Not that it will comfort you much but you are hardly alone in this. Bad backups, logging on the same servers as where access takes place and single points of failure in personnel are some of the most frequently occurring things I come across in my 'day job'.
Been there, done that.

I recently helped a company big enough that everyone here would recognize the name fix a lot of items around monitoring and logging after finding they were running an important production system in such a manner as to be essentially flying blind. Yeah, those fatal errors in the logs just might be important...

Thanks, and I know it well; that was an early lesson. Backups in particular are an amazing cesspool of problems for something so conceptually simple.
That's the thing that always bugs me, the vast majority of the items that I end up with on the todo list after a review would cost $0 or very little to get right.

Super frustrating. And you can't even rely on things staying fixed either, you have to review periodically or it will be back to square #1 within the year.

> the vast majority of the items that I end up with on the todo list after a review would cost $0 or very little to get right

Agreed, and I drive people crazy with my focus on those things. Thorough design and implementation (including testing) up front cost far less than correcting problems later, and they don't add the enormous cost of downtime and other failures.

But ... I've found that human beings, even serious professionals, have a capacity limit for details, and it's not very high; and if it's for an over-the-horizon risk, attention is very limited. That is my biggest constraint, editing down the details, organizing them, automating them, and making trade-offs to reduce them to a point where others don't throw up their hands. Also, it's hard to get the budget for that up front investment in what looks to others like obsessiveness (it's not; it's carefully considered ROI).

So when you show up for your review (I don't know exactly what you do, but I have an impression), 1,000 details might have been addressed but 50 overlooked. or 1,050 details might have been implemented but there was no capacity for the next 100 - resources ran out, something else came up, etc.

So I can see it both ways.

Good stuff, thank you, I can see there might be some way to get a process in place to avoid these relapses.