Hacker News new | ask | show | jobs
by andrenth 2729 days ago
What's your strategy for handling possible filesystem failure/corruption scenarios without a team that understands the underlying technology?
1 comments

That's a great point. I do have a team that understands the underlying technologies and has been successful in troubleshooting several production problems with Rook/Ceph, one recent one including file system corruption. My original post is just trying to state that our engineering team does not maintain a deep operational knowledge of the best way to configure, manage, monitor, scale, etc (operate) ceph in production. We rely on the Rook operator for this.

Troubleshooting acute outages caused by hardware or software failures requires a different skill than properly configuring the system to scale and minimize the chances of a corruption or outages. Rook solves the later, but we do understand the architecture and what Rook (and Ceph) are doing. We've just removed the expert level, craftsman, speciality knowledge required to operator Ceph because we decided, after a thorough evaluation, that the software in this case is the most capable solution.

I find this unusual because usually the knowledge require to troubleshoot a complex piece of software is much more complex than that required to set it up in the first place. In other words, how can you troubleshoot it if you don’t know how it’s built?

It’s a bit like debugging software you didn’t write.