Hacker News new | ask | show | jobs
by toast0 1833 days ago
This article is speaking of large scale multinode distributed systems. Hundreds of rack sized systems. In those systems, you often don't need explicit disk redundancy, because you have data redundancy across nodes with independent disks.

This is a good insight, but you need to be sure the disks are independent.

1 comments

well most often hba's and raid controllers are another thing which increases latency and makes maintenances costs go up quite a bit (more stuff to update) and also it's another part that can break.

that's why it's not recommended when running ceph.

I'm pretty sure discrete HBAs / Hardware RAID Controllers have effectively gone the way of the dodo. Software RAID (or ZFS) is the common, faster, cheaper, more reliable way of doing things.
Don’t lop HBAs and RAID controllers together. The former is just PCIe to SATA or SCSI or whatever (otherwise it is not just an HBA, but indeed a RAID controller). Such a thing is still useful and perhaps necessary for software RAID if there are insufficient ports on the motherboard.
Hardware RAID doesn't seem to be going away quickly. Since they're almost all made by the same company, and they can usually be flashed to be dumb HBAs, it's not too bad, but it was pretty painful when using managed hosting and the menu options with lots of disks all have the raid controllers that are a pain to setup; and I'm not going to reflash their hardware (although I did end up doing some SSD firmware updates myself because firmware bugs were causing issues and their firmware upgrade scripts weren't working well and were tremendously slow)
ZFS needs HBAs. Those get your disks connected but otherwise get out of the way of ZFS.

But yes, hardware RAID controllers and ZFS don't go together.

Hardware caching raid controllers do have the advantage if power is lost, the cache can still be written out without the CPU/software to do it. This let's you safely run without write-thru cache fsync. This was a common spec for provisioned bare-metal MySQL servers I'd worked with.
The entire comment thread of this article is on-prem, low scale admins and high-scale cloud admins talking past each other.

You can build in redundancy at the component level, at the physical computer level, at the rack level, at the datacenter level, at the region level. Having all of them is almost certainly redundant and unnecessary at best.

Sometimes. Other times they may make things worse by lying to the filesystem (and thereby also the application) about writes being completed, which may confound higher-level consistency models.
It does seem to me that it's much easier to reason about the overall system's resiliency when the capacitor-protected caches are in the drives themselves (standard for server SSDs) and nothing between that and the OS lies about data consistency. And for solid state storage, you probably don't need those extra layers of caching to get good performance.
Since my experience was from a number of years back, I tried searching for more recent reports: "mysql ssd fsync performance". The top recent one I found was for Digital Ocean[0] in 2020. It says "average of about 20ms which matches your 50/sec" and mentions battery back-up controllers which wasn't even in my search terms.

[0] https://www.digitalocean.com/community/questions/poor-fsync-...

I would be worried about my data behind held hostage by a black box proprietary RAID controller from a hostile manufacturer (unless you're paying them millions to build & design you a custom product, at which point you may have access to internal specs & a contact within their engineering team to help you).

I'd rather have ZFS or something equivalent in software. Software can be inspected, is (hopefully) battle-tested for years by many different companies with different workloads & requirements, and worst-case scenario, because it's software, you can freeze the situation in time by taking byte-level snapshots of the underlying drives as well as a copy of the software for later examination/reverse-engineering, something you can't do with a hardware black box where you're bound to the physical hardware and often have a single shot at a recovery attempt (as it may change the state of the black box).

Have you heard of the SSD failures about a decade ago where the SSD controller's firmware had a bug that bricked the drive past a certain lifetime? The data is technically still there, and would be recoverable if you could bypass the controller or fix its firmware, but unless you had a very good relationship with the manufacturer of the SSD to gain access to the internal tools and/or source code to allow you to tinker with the controller you were SOL.

It was RAID-1, so there's no data manipulation going on, a simple mirror copy with double the read bandwidth.