Hacker News new | ask | show | jobs
by grifferz 2108 days ago
Hi, article author here. I just tried that out on an Ubuntu 18.04 machine but it didn't work:

  $ sudo mdadm --fail /dev/md0 /dev/sdb1 --remove /dev/sdb1 --re-add /dev/sdb1 --update=no-bbl                                        
  mdadm: set /dev/sdb1 faulty in /dev/md0   
  mdadm: hot removed /dev/sdb1 from /dev/md0
  mdadm: --re-add for /dev/sdb1 to /dev/md0 is not possible
  $ sudo mdadm --add /dev/md0 /dev/sdb1 --update=no-bbl
  mdadm: --update in Manage mode only allowed with --re-add.
  $ sudo mdadm --add /dev/md0 /dev/sdb1
  $ sudo mdadm --examine-badblocks /dev/sdb1
  Bad-blocks list is empty in /dev/sdb1
Any ideas why? md0 is a simple RAID-1 metadata version 1.2 array.
2 comments

Hmm, maybe some race condition that sometimes breaks --re-add immediately after --remove? Might be worth trying separating it into two commands, i.e.:

  mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
  mdadm /dev/md0 --re-add /dev/sdb1 --update=no-bbl
I have a simple RAID-1 with metadata version 1.2, too.

My testing was done on Debian unstable (Linux v5.8, mdadm v4.1).

I added

https://raw.githubusercontent.com/prgmrcom/ansible-role-mdad...

for you.

Apologies I haven't gotten to your PR's yet, but there is a ticket now in our internal development queue to review and merge those.

Interesting. What's the sfdisk for? Is that why my attempts to use --re-add aren't working?

I also tried "mdadm --zero-superblock /dev/sdb1" to make mdadm forget that was ever an array member, but that didn't get me any further.

I use Debian so this won't help me directly, but once I work out why I can't re-add it will be possible to use something similar in a postinst hook to rebuild all the arrays.

The sfdisk is because as part of our kickstart file we only create the first partition. After removing the disk from the raid we repartition it.

I don't see why this couldn't fix up a RAID device generated by the debian installer. The device name could be parameterized if it's not always the same.

We run this script before there's any real data on the device so loss of redundancy for a brief moment is not a huge deal - md0 is a very small device so it doesn't take long to resync.

But why do you repartition it?

Doesn't putting the exact same partition table on a device that already has a partition table result in no actual changes?

We repartition it because we're adding partitions.

We have one kickstart file that we use regardless of medium type. For SSDs, we overprovision and leave unused space at the end. Some brands of SSDs were failing before we did that. We don't need to overprovision hard drives.

You could remove the repartitioning and it would do the right thing for your use case.

Ah okay. It seems that my re-adds were failing on arrays that don't have a bitmap. I was testing it on small arrays that don't get a bitmap by default.