Hacker News new | ask | show | jobs
by rleigh 2003 days ago
> Why does btrfs have those issues compared to other filesystems?

Why? There are several reasons, but if you go right back to the beginning, there's a single reason which caused all the other problems: they started coding before they had finished the design.

All of the other problems are fallout from that. Changing the design and the implementation to fix bugs after the initial implementation was done. Introducing more bugs in the process. And leaving unresolved design flaws after freezing the on-disc format.

When you look at ZFS as a comparison, the design was done and validated before they started implementing it. Not unsurprisingly, it worked as designed once the implementation was done. Up-front design work is necessary for engineering complex systems, it really goes without saying.

This isn't even unique to Btrfs, but filesystems are one thing you can't hack around with without coming to grief; you have to get it right first time when their sole purpose is to store and retrieve data reliably. Many open source projects are ridden with problems because their developers were more interested in bashing out code than stopping and thinking beforehand. Same with a lot of closed source projects as well for that matter.

In the case of Btrfs, which was aiming from the start to be a "better ZFS", they didn't even take the time to fully understand some of the design choices and compromises made in ZFS, because they ended up making choices which had terrible implications. Examples: using B-trees rather than Merkle hashes; this is at the root of many of its performance problems. Not having immutable snapshots; again has performance implications as well as safety implications, and is rooted in not having pool transaction numbers and deadlists. Not separating datasets/subvols from the directory hierarchy; presents logistical and administration challenges, while ZFS datasets can freely inherit metadata from parents and the mount locations are a separate property. ZFS isn't perfect of course, there are improvements and new features that could be made, but what is there is well designed, well thought out, and is a joy to work with.

1 comments

Can you tell how such evaluation on a design is done? Is some kind of formal verification, analysis or rather experimentation to figure out its properties normal?

Thank you for your input!

I wasn't involved so can't personally provide details of how this was done at Sun. Most of my knowledge comes from listening to talks and reading books on ZFS.

For work I'm involved in relating to safety-critical systems, we use the V-model for concepts, requirements, design and implementation, with extensive validation and verification activities at each level. Tools are used to manage all of the requirements, design details and implementation details and link them all together in a manner which aims to require self-consistency at all levels. When done correctly, this means that the person writing the code does not need to be particularly creative at this stage: the structure is completely detailed by the formal design. But it does require significant up-front effort to carefully consider and nail down the design to this level of detail. But it does avoid the need to continually revise and adapt an incomplete or bad design in a never-ending implementation phase.

This approach is definitely not for everyone, and there are many things one can criticise about it. But if you are willing to bear the financial cost and time costs of doing that detailed design work up front, the cost of implementation will be much lower and the product quality will be much greater. There is a lot to be said for not madly mashing keys and churning out code without thinking about the big picture, and Btrfs is a case study in what not to do.

The V-model is interesting. I'm a student and kinda new to the different development models.

How to decide whether such meticulous design is necessary or not? In hindsight Btrfs may have benefited, but how to decide when to and when not to in the future?

I would also be interested to know what tools are used for this. The ones I looked at seemed quite dated.. :-)

Thank you for answering! This is very interesting to learn about

This is just my own personal take on things; I'd definitely recommend reading up on the differences between Waterfall, Agile and the V-model (and Spiral model). Note that you'll see it said that the V-model is based upon Waterfall, which is somewhat true, but it's not necessarily incompatible with Agile. You can combine the two and go all the way down and back up the "V" in sprints or "product increments", but you do need the resources to do all the revalidation and reverification at all levels each time, and this can be costly (this is effectively what the Spiral model is).

In terms of deciding if meticulous up-front design is necessary (again my own take), it depends upon the consequences of failure in the requirements, specifications, design and/or implementation. A random webapp doesn't really have much in the way of consequences other than a bit of annoyance and inconvenience. A safety-critical system can physically harm one or multiple people. Examples: car braking systems, insulin pumps, medical diagnostics, medical instruments, elevator safety controls, avionics etc. It also depends upon how feasible it is to upgrade in the field. A webapp can be updated and reloaded trivially. An embedded application in a hardware device is not trivial to upgrade, especially when it's safety-critical and has to be revalidated for the specific hardware revision.

For filesystems the safety aspect will relate to maintaining the integrity of the data you have entrusted to its care. Computer software and operating systems can have all sorts of silly bugs, but filesystem data integrity is one place where safety is sacrosanct. We set a high bar in our expectation for filesystems, not unreasonably, and after suffering from multiple dataloss incidents with Btrfs, it's clear their work did not meet our expectations. We're not even going into the performance problems here, just the data integrity aspects.

I can't say anything about the tools I use in my company. There are specialist proprietary tools available to help with some of the requirements and specifications management. I will say this: the tools themselves aren't really that important, they are just aids for convenience. The regulatory bodies don't care what tools you use. The important part is the process, of having detailed review at every level before proceeding to the next, and the same again when it comes to validation and verification activities.

Often open source projects limit themselves to some level of unit testing and integration testing, which is fine. But the coverage and quality of that testing may leave some room for improvement. It's clear that Btrfs didn't really test the failure and recovery codepaths properly during its development. Where was the individual unit testing and integration test case coverage for each failure scenario? Where the V-model goes above and beyond this is in the testing of the basic requirements and high-level concepts themselves. You've got to check that the fundamental premises the software implementation is based upon are sound and consistent.