|
|
|
|
|
by zamadatix
675 days ago
|
|
> 128MB of L3 cache Sure, if you use X3D chip with the current largest amount of L3 cache accessible to a single core of any option currently available you can dedicate all of it to 128 MB of the write buffer to your disk instead of letting it be offloaded. Valid option, just as cool. I have a non X3D 7950X so jealous though ;). You've also got the case of needing to transmit up the read of the disk for modifications to sectors not cached by the system so the CPU can perform the parity calc of the whole sector and issue the appropriate writes. Particularly bad for non-sequential IO writes. > if it's 90% reduction in system DRAM utilization by RAID Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit the flash pool. |
|
Ah, sorry, lscpu shows: L3: 64 MiB (2 instances)
I originally thought that meant 64MB x 2, but it means 64MB total (32MB x 2). Still 64MB is 500 times larger than 128KB stripe and I/O normally happens on a wide variety of cores, and should only be required for stripe that are in flight. Server (normally with 5x or more cores than my 12 core desktop) and way more bandwidth (24 channels instead of my 2) will have much more cache and much more bandwidth.
> Yes, this - not the other. It's achieved by not writing things back to RAM again before they hit (comparatively slow to RAM) flash pool.
Why should the stripes be written to ram? The write should enter kernel space (write is a system call), then the software RAID driver does the calculation and then the write to the devices memory space. The PCIe connected NVMe controller is not cache coherent and can't safely read main memory, which might be cached.
I took a closer look at the original post, they seem to be considering the tiny write, which requires a read/modify/write. Said operation is pretty inefficient, and linux tries to avoid this with caching, but certainly is needed sometimes. I've not seen any analysis on what fraction of I/O to production RAID system is R/M/W instead of a normal read or write.
Even in the R/M/W case, a stripe is read by the software-RAID driver, the write is masked onto the strip, and a new checksum is calculated. Then the stripe is sent back to the I/O space for each involved NVMe controller. So a 4KB write (common minimum size) requires reading 128-256KB, doing the checksum, and writing it back to the device.
It does tip the scales more towards hardware RAID, but that's always been true for hardware RAID, which very often ends up slower than software RAID for previously discusses reasons.