Hacker News new | ask | show | jobs
by curtis3389 574 days ago
Regardless of everything else, most people should not be using bcachefs yet. Kent has even stated that unless you're okay not being able to access your data for chunks of time while bugs are being fixed, you shouldn't be using it. The conventional wisdom would be to wait 10 years after a new filesystem is introduced for it to stabilize before switching, so we're looking at summer next year at the earliest.
2 comments

Apart from that, there are (or were, last I tried it six months ago) some performance bugs in the code.

Nothing that completely breaks it, but I found at the time that the high variance on read requests for Samsung 970 series NVMe causes the filesystem to also dispatch reads of cached data to the HDDs, even when it’s fully cached.

Which predictably increases latency a lot.

Really I should make another stab at fixing that, but the whole driver is C, and I’m not good at writing flawless C. Combine that with the problem actually being hard…

(“Always read from SSD” isn’t a valid solution here.)

"Always read from SSD" seems like what you'd want, no?

I have something on the back burner to start benchmarking devices at format time, that would let us start making better decisions in these situations.

Sorry to say, I have some old SSDs that are only 3-4 times faster than the HDDs. Especially when there’s a lot of HDDs in the pool, ignoring them could be leaving a lot of performance on the floor.

Though it would be an improvement over what I saw last time I tried, sure.

Oh, that is tricky. If you want to play around with the algorithm that picks which device to read from, it's in fs/bcachefs/extents.c

  static inline bool ptr_better(struct bch_fs *c,
                              const struct extent_ptr_decoded p1,                                                                            
                              const struct extent_ptr_decoded p2)                                                                            
  {             
        if (likely(!p1.idx && !p2.idx)) {                                                     
                u64 l1 = dev_latency(c, p1.ptr.dev);
                u64 l2 = dev_latency(c, p2.ptr.dev);                                          
                
                /* Pick at random, biased in favor of the faster device: */
                                                                                              
                return bch2_rand_range(l1 + l2) > l1;
        }       
                            
        if (bch2_force_reconstruct_read)
                return p1.idx > p2.idx;        
                                               
        return p1.idx < p2.idx;
  }
Perhaps just squaring the device latencies would balance things out more the way we want.
I remember this code!

If we're talking about my desktop, its current configuration is 3x 2TB NVMe (configured as zfs cache) plus 2x 12TB HDDs (mirrored). I've set sync=disabled, with transaction groups committing every 10 minutes — this is fine for my use case — so the HDDs spend most of their time spun down.

I only actually have 4TB of data on the system. It keeps growing, but the working set is probably much less than that.

Which means, it's 100% cached. A single read sent to the HDDs would have a latency of multiple seconds; absolutely catastrophic for a desktop workload. In this case _always_ using the cache _is_ the right answer, but I've been trying to think of an algorithm that would be able to do so without hardcoding it.

Is there not some sort of standardized, stringent filesystem test yet? Like Jepsen is for databases? If passed, one can be sure it is reasonably free from bugs? Guess not.
The thing is that filesystems are inherently statful, so the same test might trigger different edge cases depending on the state of the fs.
Databases are for saving state, no?