Recover a VCF vSAN Cluster from a power outage

I had a power outage, and my VCF was down. When I tried to power it on, all the vSAN Cluster was gone, and no VMs available. Only 2 of the 4 ESXi hosts cluster were configured correctly with the vSAN Cluster Datastore.

So no vCenter, SDDC Manager, or NSX to power on the VCF.

Checking vSAN Cluster found some issues in the network and partitions.

Run the command esxcli vsan health cluster list -w to check the issues in the vSAN Cluster.

As we can check above, it shows some issues in the Network and Partitions.

Next, I checked the vSAN Cluster on the ESXi hosts that had the issue(01 and 02) with the command esxcli vsan Cluster get.

As suspected, they do not belong to the vSAN Cluster and have isolated Clusters, and both are masters of their own Cluster, which means that these hosts are not communicating with the other ESXi hosts.

Checking the ESXi hosts 03 and 04, they are ok and belong to the right vSAN Cluster. Also, I double-check in all ESXi hosts the unicast with the command esxcli vsan cluster unicastagent list to see if there are any issues.

No issues in the unicast in all ESXi hosts are pointing to the other ESXi, except to itself, so issues need to be in the network.

Back to not working ESXi hosts, I run the command vmkping -I vmk2 192.168.10.138 (try to ping all ESXi hosts in the Cluster), and ESXi 01 and 02 are not pinging the rest of the ESXi hosts.

Since this is a nested environment, I removed the vSAN network from all broken ESXi hosts and added it again. Rerun the vmkping, and we have a connection.

It is strange, but I have seen this one more time, not only in the nested environment but also in physical environments. We need to remove the cable(or disable the switch port) and added again so that somehow ESXi host port has a network again.

So next is to remove the broken ESXi hosts from the isolated vSAN Cluster with the command esxcli vsan cluster leave and then add with the command restore esxcli vsan Cluster restore.

As we can see now, the ESXi host is already in the proper vSAN Cluster with the other two(03 and 04). After following some process for all ESXi hosts, reboot and vSAN Cluster and back and VCF is working again.

VMs and vSAN datastore were back. After running the vSAN health check, many objects in the vSAN were out of sync. But as long as one ESXi host has all the objects from each VM, no data loss.

In this case, we run the re-sync option, and absent and out-of-sync objects will be fixed. Depending on the data size, this can take several hours.

After some hours, double-check if your objects are all green.

After all objects are green, I can finally start working in my VCF and implement HCX to test some migrations.

Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.

About the Author: Luciano Patrao

I have over 20 years of experience in the IT industry. I have been working with Virtualization for more than 15 years (mainly VMware). I recently obtained certifications, including VCP DCV 2022, VCAP DCV Design 2023, and VCP Cloud 2023. Additionally, I have VCP6.5-DCV, VMware vSAN Specialist, vExpert vSAN, vExpert NSX, vExpert Cloud Provider for the last two years, and vExpert for the last 7 years and a old MCP. My specialties are Virtualization, Storage, and Virtual Backup. I am a Solutions Architect in the area VMware, Cloud and Backup / Storage. I am employed by ITQ, a VMware partner as a Senior Consultant. I am also a blogger and owner of the blog ProVirtualzone.com