In this How to fix vSAN Cluster hosts out of sync, I will explain how to fix this type of issues triggered by network issues.
This week we had some issues in one of our vSAN vCenter. Some warnings about network and vSAN nodes cannot communicate with each other.
vSAN network uses DHCP provided by Juniper Switches, and there was an issue with the time/date in the Switches. The network team did fix the issues, but by changing the Switches’ date/time, all DHCP releases get expired. This change triggers vSAN network IPs to change and then a lot of vSAN warnings.
Note: VMware recommends that vSAN VMkernel only use Static IPs or DHCP with reservations. In my implementations, I always use Static IPs for any VMkernel. I never use DHCP, even is supported or not. This is one case why I don’t like to use DHCP for this type of important VMkernel Ports.
For this case, since the Switches were applied with a new profile to fix some of the errors, it may clean all the DHCP reservations table. Here I was only supporting to fix the issues, nor responsible for the implementation.
The first task done by the vSAN team was to add the new vSAN IPs (after changing the vSAN VMkernel Ports to static) to the unicast table in each vSAN node. But still, there were some warnings, and vSAN was not working.
Adding new IP vSAN IPs to the unicast table using the command: esxcli vsan cluster unicastagent add -t node -u node-UUID -U true -a node-IP -p 12321
And we need to change IgnoreClusterMemberListupdates to 0 in all vSAN nodes so that vCenter can push the changes. This is done by running the command: esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates.
But in vSAN, we still get the warning:
When I take over the issue and start troubleshooting, the vSAN issues found several issues.
First, I started to check the health of the vSAN Cluster network running the following command: esxcli vsan health cluster get -t “ESXi vSAN Health service installation”.
Normally with this command, we should get all vSAN IPs and each node status (all should be green).
We should get all the vSAN nodes IPs, but I only get the actual IP of the vSAN node.
Next, I check the health of the vSAN Cluster network connectivity by running the following command: esxcli vsan health cluster get -t “Hosts with connectivity issues”.
Note: Since this is a sensitive area and production system, for security reasons, I will blur some of the information.
This is what I get:
None of those IPs were the actual IPs in the above results, but the old ones before the Switch issue.
So definitely, there is no connectivity with the vSAN node using the vSAN network.
Next is to check the unicast table list and see if there are any issues and have all the entries. This is done running the following command: esxcli vsan cluster unicastagent list
Check the vSAN node UUID: esxcli vsan cluster get | grep “Local Node UUID:” and double-check the vSAN Bode IPs: esxcli network ip interface ipv4 get
As we can see above, a lot of entries and pretty and mismatched entries. Even IP from this vSAN node is assigned to a different UUID.
Checking a second vSAN Node, get similar results, and definitely, UUID and IPs don’t match.
Looking at these unicast tables, this is the root cause for the issues. Unicastagent entries are wrong in all the vSAN nodes.
First, we need to clean all the unicast tables in all vSAN nodes to fix this issue.
How to do this? We need to do four steps.
- Remove all the entries with the following command: esxcli vsan cluster unicastagent remove -a node-IP
- Check the proper UUID for the node: esxcli vsan cluster get |grep “Local Node UUID:”
- Check the vSAN IP node: esxcli network ip interface ipv4 get
- Add the vSAN IP and UUID for all vSAN nodes in the Cluster (do not add the IP+UUID of the node you are working): esxcli vsan cluster unicastagent add -t node -u node-UUID -U true -a node-IP -p 12321
Do these four steps in all vSAN nodes.
After all vSAN nodes area clean and added with the proper unicastagent settings (UUID and IP), double-check the vSAN health again.
esxcli vsan health cluster get -t “ESXi vSAN Health service installation” and esxcli vsan health cluster get -t “Hosts with connectivity issues
As we can see in the next image, all IPs for all vSAN nodes are now listed, also no Host connectivity issues.
Now go back to vCenter and run Skyline Health check for vSAN Cluster. After I rerun the Health Check, I still get the “.
This is because I forgot to change the IgnoreClusterMemberListupdates in all the vSAN nodes. So I run the command: esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates in all vSAN Nodes and reboot all vSAN(not mandatory).
After all vSAN nodes are running, I rerun the optionand the Skyline Health, and all checks were green.
To make sure all was ok and I had no issues creating VMs in the vSAN, I run the Proactive Tests, and both passed.
Final step is to change back the IgnoreClusterMemberListupdates to 1. Run the command: esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates
After that last step, vSAN was ready, and all issues with the unicastagent were fixed. I hope this blog post How to fix vSAN Cluster hosts out of sync was useful and teach you how to fix similar issues with vSAN Network.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.