How to fix vSAN Cluster hosts out of sync

/, vSAN, vSphere/How to fix vSAN Cluster hosts out of sync

How to fix vSAN Cluster hosts out of sync

In this How to fix vSAN Cluster hosts out of sync, I will explain how to fix this type of issues triggered by network issues.

This week we had some issues in one of our vSAN vCenter. Some warnings about network and vSAN nodes cannot communicate with each other.

vSAN network uses DHCP provided by Juniper Switches, and there was an issue with the time/date in the Switches. The network team did fix the issues, but by changing the Switches’ date/time, all DHCP releases get expired. This change triggers vSAN network IPs to change and then a lot of vSAN warnings.

——-
Update 30/11/2020

Note:
VMware recommends that vSAN VMkernel only use Static IPs or DHCP with reservations. In my implementations, I always use Static IPs for any VMkernel. I never use DHCP, even is supported or not. This is one case why I don’t like to use DHCP for this type of important VMkernel Ports.
For this case, since the Switches were applied with a new profile to fix some of the errors, it may clean all the DHCP reservations table. Here I was only supporting to fix the issues, nor responsible for the implementation.
——-

The first task done by the vSAN team was to add the new vSAN IPs (after changing the vSAN VMkernel Ports to static) to the unicast table in each vSAN node. But still, there were some warnings, and vSAN was not working.

Adding new IP vSAN IPs to the unicast table using the command: esxcli vsan cluster unicastagent add -t node -u node-UUID -U true -a node-IP -p 12321

And we need to change IgnoreClusterMemberListupdates to 0 in all vSAN nodes so that vCenter can push the changes. This is done by running the command: esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates.

But in vSAN, we still get the warning: vCenter state is authoritative, even when the Update ESXi Configuration was run. We have a “Hosts out of sync” issue.

How to fix vSAN Cluster hosts out of sync

When I take over the issue and start troubleshooting, the vSAN issues found several issues.

First, I started to check the health of the vSAN Cluster network running the following command: esxcli vsan health cluster get -t “ESXi vSAN Health service installation”.

Normally with this command, we should get all vSAN IPs and each node status (all should be green).

We should get all the vSAN nodes IPs, but I only get the actual IP of the vSAN node.

Next, I check the health of the vSAN Cluster network connectivity by running the following command: esxcli vsan health cluster get -t “Hosts with connectivity issues”.

Note: Since this is a sensitive area and production system, for security reasons, I will blur some of the information.

This is what I get:

How to fix vSAN Cluster hosts out of sync

None of those IPs were the actual IPs in the above results, but the old ones before the Switch issue.

So definitely, there is no connectivity with the vSAN node using the vSAN network.

Next is to check the unicast table list and see if there are any issues and have all the entries. This is done running the following command: esxcli vsan cluster unicastagent list

Check the vSAN node UUID: esxcli vsan cluster get | grep “Local Node UUID:” and double-check the vSAN Bode IPs: esxcli network ip interface ipv4 get

How to fix vSAN Cluster hosts out of sync

As we can see above, a lot of entries and pretty and mismatched entries. Even IP from this vSAN node is assigned to a different UUID.

Checking a second vSAN Node, get similar results, and definitely, UUID and IPs don’t match.

How to fix vSAN Cluster hosts out of sync

Looking at these unicast tables, this is the root cause for the issues. Unicastagent entries are wrong in all the vSAN nodes.

First, we need to clean all the unicast tables in all vSAN nodes to fix this issue.

How to do this? We need to do four steps.

  1. Remove all the entries with the following command: esxcli vsan cluster unicastagent remove -a node-IP
  2. Check the proper UUID for the node: esxcli vsan cluster get |grep “Local Node UUID:”
  3. Check the vSAN IP node: esxcli network ip interface ipv4 get
  4. Add the vSAN IP and UUID for all vSAN nodes in the Cluster (do not add the IP+UUID of the node you are working): esxcli vsan cluster unicastagent add -t node -u node-UUID -U true -a node-IP -p 12321

Do these four steps in all vSAN nodes.

How Cluster hosts out of sync

After all vSAN nodes area clean and added with the proper unicastagent settings (UUID and IP), double-check the vSAN health again.

esxcli vsan health cluster get -t “ESXi vSAN Health service installation” and esxcli vsan health cluster get -t “Hosts with connectivity issues

As we can see in the next image, all IPs for all vSAN nodes are now listed, also no Host connectivity issues.

How Cluster hosts out of sync

Now go back to vCenter and run Skyline Health check for vSAN Cluster. After I rerun the Health Check, I still get the “vCenter state is authoritative”.

How to fix vSAN Cluster hosts out of sync

This is because I forgot to change the IgnoreClusterMemberListupdates in all the vSAN nodes. So I run the command: esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates in all vSAN Nodes and reboot all vSAN(not mandatory).

After all vSAN nodes are running, I rerun the option Update ESXi Configuration and the Skyline Health, and all checks were green.

How to fix vSAN Cluster hosts out of sync

To make sure all was ok and I had no issues creating VMs in the vSAN, I run the Proactive Tests, and both passed.

Final step is to change back the IgnoreClusterMemberListupdates  to 1. Run the command: esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates

After that last step, vSAN was ready, and all issues with the unicastagent were fixed. I hope this blog post How to fix vSAN Cluster hosts out of sync was useful and teach you how to fix similar issues with vSAN Network.

Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.

©2020 ProVirtualzone. All Rights Reserved
By | 2020-11-30T23:12:34+01:00 November 28th, 2020|VMware, vSAN, vSphere|4 Comments

About the Author:

I am over 20 years’ experience in the IT industry. Working with Virtualization for more than 10 years (mainly VMware). I am an MCP, VCP6.5-DCV, VMware vSAN Specialist, Veeam Vanguard 2018/2019, vExpert vSAN 2018/2019 and vExpert for the last 4 years. Specialties are Virtualization, Storage, and Virtual Backups. I am working for Elits a Swedish consulting company and allocated to a Swedish multinational networking and telecommunications company as a Teach Lead and acting as a Senior ICT Infrastructure Engineer. I am a blogger and owner of the blog ProVirtualzone.com

4 Comments

  1. Jameson 30/11/2020 at 17:50 - Reply

    Great Acticle! However, your environment went down because of your own mistake in configuration. vSAN can use DHCP sure, but that’s not best practice. What prevents you from using static IPs like best practice recommends? Either way nice article but your problem is likely to re-occur the next time the DHCP leases renew…

    • Luciano Patrao 30/11/2020 at 23:01 - Reply

      Hi Jameson,

      Thank you for your message.

      Yes vSAN can use DHCP as long there is reservations(but even those I dont use, I only use static IP for VMkernel ports). But with the Switch issue, it may have been clean and lost all information. But again, I am not responsible for implementation or the system, I was only ask to support to fix the issue.

      Thank You

      Luciano Patrao

  2. Brian 30/11/2020 at 18:11 - Reply

    Hi Luciano – according to the VMWare Best Practices documentation https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-planning.doc/GUID-031F9637-EE29-4684-8644-7A93B9FD8D7B.html under the note: “The following configuration is not supported: vCenter Server deployed on a vSAN 6.6 cluster that is using IP addresses from DHCP without reservations. You can use DHCP with reservations, because the assigned IP addresses are bound to the MAC addresses of VMkernel ports.”

    You didn’t mention in your article if you had reservations on your DHCP or not, but I suspect that is the case. Personally I would never deploy a vSAN environment using DHCP on the kernel ports for vSAN as this introduces a potentially unnecessary problem such as what you experienced. Had you used static addresses you would not have encountered this issue.

    • Luciano Patrao 30/11/2020 at 22:58 - Reply

      Hi Brian,

      Thanks for your message.

      Yes you are right, I don’t think I ever use DHCP in VMkernel ports(in any) in my implementations.

      I am not responsible for this environment, I was just add to check the issue and fix the problem.
      Honestly I don’t know if there were reservations, with the Switch issue, it may had but after they reinstall the Switch profile it may had clean that. After the issue all vSAN VMkernel ports were switch to static (as it should be).

      I will add some extra information to the article informing about the DHCP supported or not supported.

      Thanks again for your message.

      Luciano Patrao

Leave A Comment