Hacker News new | ask | show | jobs
by Filligree 574 days ago
Apart from that, there are (or were, last I tried it six months ago) some performance bugs in the code.

Nothing that completely breaks it, but I found at the time that the high variance on read requests for Samsung 970 series NVMe causes the filesystem to also dispatch reads of cached data to the HDDs, even when it’s fully cached.

Which predictably increases latency a lot.

Really I should make another stab at fixing that, but the whole driver is C, and I’m not good at writing flawless C. Combine that with the problem actually being hard…

(“Always read from SSD” isn’t a valid solution here.)

1 comments

"Always read from SSD" seems like what you'd want, no?

I have something on the back burner to start benchmarking devices at format time, that would let us start making better decisions in these situations.

Sorry to say, I have some old SSDs that are only 3-4 times faster than the HDDs. Especially when there’s a lot of HDDs in the pool, ignoring them could be leaving a lot of performance on the floor.

Though it would be an improvement over what I saw last time I tried, sure.

Oh, that is tricky. If you want to play around with the algorithm that picks which device to read from, it's in fs/bcachefs/extents.c

  static inline bool ptr_better(struct bch_fs *c,
                              const struct extent_ptr_decoded p1,                                                                            
                              const struct extent_ptr_decoded p2)                                                                            
  {             
        if (likely(!p1.idx && !p2.idx)) {                                                     
                u64 l1 = dev_latency(c, p1.ptr.dev);
                u64 l2 = dev_latency(c, p2.ptr.dev);                                          
                
                /* Pick at random, biased in favor of the faster device: */
                                                                                              
                return bch2_rand_range(l1 + l2) > l1;
        }       
                            
        if (bch2_force_reconstruct_read)
                return p1.idx > p2.idx;        
                                               
        return p1.idx < p2.idx;
  }
Perhaps just squaring the device latencies would balance things out more the way we want.
I remember this code!

If we're talking about my desktop, its current configuration is 3x 2TB NVMe (configured as zfs cache) plus 2x 12TB HDDs (mirrored). I've set sync=disabled, with transaction groups committing every 10 minutes — this is fine for my use case — so the HDDs spend most of their time spun down.

I only actually have 4TB of data on the system. It keeps growing, but the working set is probably much less than that.

Which means, it's 100% cached. A single read sent to the HDDs would have a latency of multiple seconds; absolutely catastrophic for a desktop workload. In this case _always_ using the cache _is_ the right answer, but I've been trying to think of an algorithm that would be able to do so without hardcoding it.