Typically, replacing drives in Nutanix is easy. If the drive fails, you replace it… Prism sucks it in and it’s ready to go in minutes. If you’re upgrading from a brand new drive, you remove the drive… wait half a day, and put the new drive in.
The challenge begins when you start re-purposing drives from another cluster, or when you’re running 5.9.2.3 (like I was) and run into annoying little bugs with tasks that should be easy. Don’t worry— I addressed the bugs with support. This article may be largely irrelevant later this year.
For the purposes of this article, I’m making the assumption you run AHV on everything like me. I’m also assuming that you’re still going to open up a ticket with support, because you should let them know about problems. How else are they supposed to address them?! I open a LOT of tickets and I don’t feel bad. Nutanix support continues to be the best support I’ve ever received for any technology product.
Re-using SATA drives
In my scenario, I was upgrading our primary cluster’s SATA tier, and using the drives we pulled out as hand me down upgrades to our ROBO clusters. This process isn’t terrible with SATA, but there is an extra step. After inserting the new drive, you’ll notice that it isn’t automatically added back in. You must click “Repartition and Add.”
Captivating. Let’s move on.
Re-using SSD drives
In a hybrid system, the CVM’s run off the SSD tier. Single SSD systems require a bit more work than dual SSD systems. SSD’s in a dual SSD system are in a mirror, you can remove one at a time and it’s not a big deal. CVM stays up. It might reboot on its own. It’s okay, it’s just part of the process. I haven’t had a re-use case come up yet for the dual SSD systems so I’m not sure how this process would be impacted.
For single SSD systems, you have to shutdown the CVM before removing the drive. Once the drive is replaced, you initiate a “Repair Drive.” This is where things get fun.
If the drive replacement fails at 12%, which it did for me on only one cluster for every single disk, the drive probably needs to be reseated (even if you thought you did a good job). I would load up the CVM console and watch the progress… it would fail to image because it couldn’t find the drive. Reseat the drive with the console open, and it should display some messages to confirm it sees the drive. Shutdown the CVM, retry the repair.
Once you get past the 12% hurdle, the next hurdle is 52%. If your disk replacement fails at 52%, you’re probably in the same boat I am… you changed all your passwords like a good IT person should. There is apparently a bug in password handling for repair tasks that assumes default passwords are being used. Silly, but the solution is easy. Login to a working CVM and issue the following command:
boot_disk_replace -i CVM_IP_NEEDING_REPAIR --hypervisor_password='HYPERVISOR_ROOT_PASSWORD'
This command assumes that the CVM being replaced is powered on and running a freshly imaged default configuration. It will apply the configuration and reboot the CVM.
Now, since we’re talking about disk re-use, what you’ll notice is that the CVM services will start, but nodetool will report that the CVM is in a forwarding state. This is because sda4 has a bunch of junk from the cluster you moved the disk from. A quick way to verify that this is the case is to login to the freshly repaired CVM and enter the following command:
cat ~/data/logs/hades.out | grep 'Cleaning'
cat ~/data/logs/hades.out | grep 'not partitioned'
What you’ll find is something similar to the following:
INFO disk_manager.py:5004 Disk BTHC528102VG800NGN is not present in zeus configuration, but data is present. Cleaning of disk is required
INFO disk_manager.py:4980 Disk BTHC528102VG800NGN is not present in zeus configuration and is not partitioned correctly. Repartition and mount required
This was enough verification for me, but you may want more. Tail hades.out and you’ll find the whole story.
INFO disk_manager.py:4820 Preparing hades proto with updated disk state
INFO disk_manager.py:4840 Only allow SED drives: False
INFO disk_manager.py:4946 Disk BTHC528102VG800NGN is not present in zeus configuration
INFO disk_manager.py:4988 Disk /dev/sda with serial BTHC528102VG800NGN is not stargate usable, unmounting
INFO disk_manager.py:922 Waiting for disk mount lock for unmount disk /dev/sda
INFO disk_manager.py:935 Unmounting partitions on disk /dev/sda
INFO disk_manager.py:947 Unmounting partition /dev/sda4 on path /home/nutanix/data/stargate-storage/disks/BTHC528102VG800NGN
INFO disk.py:464 Unmounting partition /dev/sda4
INFO disk_manager.py:4297 Led fault request for disks ['/dev/sda']
2019-03-19 23:01:36 INFO disk_manager.py:4563 Running LED command: '/home/nutanix/cluster/lib/lsi-sas/sas3ircu 0 locate 1:0 ON'
INFO disk_manager.py:5004 Disk BTHC528102VG800NGN is not present in zeus configuration, but data is present. Cleaning of disk is required
INFO disk_manager.py:4946 Disk BTHC528102VG800NGN is not present in zeus configuration
INFO disk_manager.py:4980 Disk BTHC528102VG800NGN is not present in zeus configuration and is not partitioned correctly. Repartition and mount required
To solve this issue, you’ll need to run a series of commands from the affected CVM. Don’t worry, since the CVM is in forwarding state, these commands will have no impact on the cluster.
genesis stop all
sudo /home/nutanix/cluster/bin/hades stop
sudo /home/nutanix/cluster/bin/clean_disks -p /dev/sda4
sudo /home/nutanix/cluster/bin/hades start
genesis start
This will wipe the data partition and allow the drive to be used by the CVM. The services will slowly restart. Monitor progress using nodetool. All CVMs should report ‘Normal’ status.
nodetool -h 0 ring
Missing Repair Drive Option
There were a few instances, where replacing the drive did not provide us with a “Repair Disk” option. I assume this is a Prism UI bug. The drive was seated and present but the option did not present itself in the Hardware > Diagram view after multiple attempts at clearing browser cache and logging in and out a half dozen times.
Here is the CLI command to manually start a repair drive operation. Ensure that you have followed the appropriate steps. The affected CVM is shutdown. The drive has been swapped and seated properly. If this is true, issue the following command from any working CVM:
single_ssd_repair -s CVM_IP_NEEDING_REPAIR
Hopefully this saves you a call or two to support. I think I opened about 5 or 6 through this process. I came across every scenario imaginable. I’m heavily involved in both TAB and NTC channels, so I’ll be working to get these ‘bugs’ addressed in future releases so it’s less of a pain. This was still a way better experience than I’m used to with other platforms. Nutanix <3
But WAIT, there’s more!
If you replace your entire SSD tier, and find yourself unable to login to Prism Element or your CVM because the password isn’t working anymore… chill. Everything got set back to factory defaults. Support is working on filing the bug report, but this happened to me on EVERY single SSD system I upgraded the SSD tier on. After the last SSD was replaced, both the CVM password and Prism Element were reset to defaults. Easy fix. Follow the password change procedure for CVM. I was able to reset it back to what it was before. It apparently forgot the password history too. As far as Prism Element goes, login with default credentials and it’ll prompt you to change your password. Set it back to what it was, and you’re done.