| HN Mirror

This. We had the same failure mode, on a cluster with dozens of nodes and thousands of VMs.

http://thr3ads.net/gluster-users/2011/06/480298-Enomaly-user...

This bug (and others) would happen when moving files and/or folders over top of existing ones, ie:

  mkdir -p foo/bar/baz && \

  mkdir -p foo/bar/tmp && \

  touch foo/bar/baz/file foo/bar/tmp/file && \

  mv foo/bar/tmp/file foo/bar/baz/ && \

  rm -rf foo/bar/tmp

This was a method to replace an existing lockfile. Mercurial uses some similar code. The expected behavior would be that foo/bar/tmp/file replaced /foo/bar/baz/file. Instead, the outcome was that there was a race in Gluster where it got confused about which version of 'file' was correct, and it would end up with a split brain between the two nodes. This would be exacerbated by a node failure, but didn't always require one. Heavy load seemed to make the failure more likely. We couldn't replicate the same bug moving files in the same folder, it was the subfolders in the same gluster fs that seemed to cause the issue. The frustrating part was how gluster pointed fingers at the file operation being incompatible, despite advertising a posix compatible filesystem. Apparently the bug is fixed, but we moved off gluster, never to return.

Also, filesystems are hard, I get it, so no hard feelings :)