|
So to understand why this is you have to appreciate the goals behind write-anywhere-file-layout (aka WAFL) file systems. [1] One of the goals of such systems is that copy of the file system on disk is always consistent, turn power off at any point and you can come right back up with a valid file system. This is accomplished by only writing to the 'free block list'. You construct updated inodes from the file change all the way up to the root inode out of new blocks and then to "step" forward you write a new root block. This is really neat and it means that when you've done that step, you still have the old inodes and datablocks around, they just aren't linked but you can link them to another "holder" inode attached to the name ".snapshot" and it will show you the file system just before the change. Write the old root block back into the real root block and "poof!" you have reverted the file system back to the previous snapshot. Ok, so that is pretty sweet and really awesome in a lot of ways, but it has a couple of problems. The first, as noted, is that it pretty much guarantees fragmentation as its always reaching for free blocks and they can be anywhere. On NetApp boxes of old, that wasn't too much of a big deal because everything was done "per RAID stripe" so you were fragmented, but you were also reading/writing full stripes in RAID so you had the bandwidth you needed and fragmentation was absorbed by the efficiencies of full stripe reads/writes. But the second issue arises when you start getting close to full, managing the free block list gets harder and harder. You are constantly getting low block pressure, so you are constantly trying to reclaim old blocks (on unused snapshots, or expired ones) and that leads to a big drop in performance. The math is you can't change more of the data between snapshot steps than the amount of space you have free. That is why NetApp filers would get cranky using them in build environments where automated builds would delete scads of intermediate files, only to rebuild them and then relink them. Big percentage change in the overall storage. On the positive side, storage is pretty darn cheap these days, so a swapping in 3TB drives instead of 2TB drives means you could use all the storage you "planned" to use and keep the drives at 66% occupancy. Hard on users though who will yell at you "It says it has 10TB of storage and is only using 6TB but you won't expand my quota?" At such times it would be useful for the tools to lie but that doesn't happen. [1] Disclosure 1, I worked for 5 years at NetApp with systems that worked this way. Disclosure 2, an intern with NetApp (we'll call him Matt) was very impressed with this and went on to work at Sun for Jeff and similar solutions appeared in ZFS. |
Goal yes, implementation no. WAFL does in fact have consistency problems and filers do ship with a consistency checker called "wack" which if you ever need this tool you'll probably have better luck throwing the filer in the trash and restoring from backups rather than waiting a month for it to complete.