| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by caraboga 4294 days ago

Having inherited an openstack install with the swift implementation implemented on earlier generation of pods, these are the issues I and my cohorts ran into:

1) The object replication between the pods will be hard on the array, as the object replication is simply an rsync wrapped inside nested for loops. If you have a bunch of small files with a lot of tenants, it'll hurt.

2) While swift allows you to simply unmount a bad disk, change around the ring files and let replication do its thing, there are actual issues. First of all, out of band smart monitoring of bad sectors actually cases the disk to pre-empt some sata commands and do the smart checks first. On a heavily loaded cluster, under a smart check for a bad block count could kill the drive, and take out the sata controller along with it. We've downed storage pods before by that way. The only way we got around it was to take a pod offline once a week, run all our smart checks, then put it back.

3) To replace a drive, you have to open the machine up and power it off. As any old operator will tell you, drives that have been running for awhile do not like to be turned off. If you are to power off a machine to replace a bad drive, do realize that you might actually break more drives just from the power cycle.

4) Once you change a drive out and use the ring replication of files to rebuild your storage pods, your entire storage cluster will take a non-trivial hit.

5) Last but not least, it is almost of tantamount importance to move the proxy, account and container services away from the hardware that also host the object servers. It's probably good to note that the account and container meta data is stored in sqlite files, with fsync writes. If you add and remove a bunch of files to multiple accounts, the container service is going to get hit first, then the object service. Further more, every single transaction to every meta data/data service, including replication, is federated through the proxy servers. If you look at a swift cluster, the proxy services take up a large chunk of the processing space.

Source: Was Openstack admin for a research group in Middle Tennessee. Ran an openstack cluster with 5 gen1.5 pods for the entire swift service, then moved account/container/proxy to three 12 core 2630s with 64 gigs of ram. Cluster was for a DARPA vehicular design project, first part fielded 3000 clients, second part fielded about 1/10 of that, but with more files and bigger files (these were cad and test results respectively).

2 comments

rdkls 4294 days ago

Thanks caraboga. I was contemplating this setup a year or so ago, and great to learn your insights from actual implementation. It looked to me like Supermicro JBOD enclosures would be superior (mainly hot swappability) if a bit more expensive, would you agree?

link

caraboga 4293 days ago

My predecessor went with Backblaze clones for the drive and mb enclosure. You'll run into issues with the backblaze way as they have two power supplies, but one is for the motherboard and the boot drives, and the other power supply is for the actual drives. Further more, as this was a backblaze pod clone, there's no ipmi on the pod motherboards. It makes certain things a bit more annoying than they have to be.

If I were to do it again, I would stay away from using enclosures that were inspired by the BackBlaze models. Supermicro enclosures are fine.

link

rincebrain 4293 days ago

I'd be curious to know what drives and controllers you had that pre-empted requests due to SMART queries...

link

caraboga 4293 days ago

I don't recall the model, but they were 3 terabyte Seagates that were bought 3 years ago. I think my predecessor employed the same controller cards as version 1 of the pods. You could tell that the smart queries pre-empted normal io as the smart query would be executed, and disk activity for certain swift object servers would just stall. The object-servers would not return requests for at least some of the drives until the smart query was finished.

Curiously enough, I've also ran smartctl against Samsung 840Pros using Highpoint RocketRaid controller cards during the same project. Sometimes this crashes the controller card.

(This was for a gluster cluster, when gluster had broken quorum support. A copy of a piece of data, when out of sync with the other copies in a replica set, would be left unchecked. This was produced after one smartctl command took out the controller card, and then the machine along with it.)

link