In this How to fix wrong vSAN Cluster partitions, I will explain how to check and fix these issues in your vSAN Cluster.
In my vSAN, I was getting some warnings about vSAN cluster partitions. Checking my vSAN partitions, I notice vSAN has some Cluster partitions.
As we can notice in the above image, we have two different partitions. To understand why we need to check in each ESXi vSAN Cluster.
Connect to ESXi through ssh and start to run “esxcli vsan cluster get” to check each ESXi Cluster member.
-
vSAN ESXi Node 04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-04:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T03:29:28Z Local Node UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89de16-eb91-32eb-2af1-00505696b4f0 Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 1 Sub-Cluster Member Count: 2 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89de16-eb91-32eb-2af1-00505696b4f0 Sub-Cluster Membership UUID: 3039ab5a-2040-b3d8-d4b3-005056968e4b Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 18 2018-03-16T03:26:34.8852 |
-
vSAN ESXi Node 02
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-02:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T03:47:25Z Local Node UUID: 5a89de16-eb91-32eb-2af1-00505696b4f0 Local Node Type: NORMAL Local Node State: BACKUP Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89de16-eb91-32eb-2af1-00505696b4f0 Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 1 Sub-Cluster Member Count: 2 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89de16-eb91-32eb-2af1-00505696b4f0 Sub-Cluster Membership UUID: c23dab5a-e0be-04e2-ce76-005056968e4b Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 18 2018-03-16T03:46:05.983 |
-
vSAN ESXi Node 03
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-03:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T03:47:27Z Local Node UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 1 Sub-Cluster Member Count: 2 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster Membership UUID: c13dab5a-fff4-d93e-0f0e-0050569646fa Unicast Mode Enabled: true Maintenance Mode State: ON Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 18 2018-03-16T03:46:05.145 |
-
vSAN ESXi Node 01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-01:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T03:47:18Z Local Node UUID: 5a89c7a3-0bca-7bb5-bc26-00505696e28b Local Node Type: NORMAL Local Node State: BACKUP Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 1 Sub-Cluster Member Count: 2 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster Membership UUID: c13dab5a-fff4-d93e-0f0e-0050569646fa Unicast Mode Enabled: true Maintenance Mode State: ON Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 18 2018-03-16T03:46:05.160 |
As we can notice in the above information, somehow, I have 2 different vSAN Clusters. Each one with two ESXi hosts. Node 03 and 01 is in one vSAN Cluster with the UUID:52b57974-6769-70cc-346a-b99c5762a232, and node 04 and 02 is in another vSAN Cluster with the UUID:52b57974-6769-70cc-346a-b99c5762a232, but with Sub-Cluster Membership UUID:23dab5a-e0be-04e2-ce76-005056968e4b. When should be the Master UUID:c13dab5a-fff4-d93e-0f0e-0050569646fa.
Each vSAN Cluster has its Master and Backup. When for a four nodes vSAN Cluster, I should have 1 Master, 1 Backup, and 2 Agents, all using the same vSAN Cluster UUID and Sub-Cluster Master UUID.
This can happen if you remove an ESXi host from the Cluster and added again. Or lose connection, and you need to reconnect. Since this is a vSAN playground, I am pretty sure some of my tests did cause this.
How to fix it
To fix the wrong vSAN Cluster partitions, we need to remove ESXi from the false clusters and add all nodes to one Cluster.
First, I removed ESXi node 02 from the cluster and added it to the proper vSAN Cluster.
Use the command to remove the vSAN ESXi node from vSAN Cluster
1 2 3 |
[root@vSAN-02:~] esxcli vsan cluster leave |
Immediately when I remove this ESXi node from that vSAN Cluster, it shows now that it is a member of the other vSAN Cluster. Something I thought was strange.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-02:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T03:54:18Z Local Node UUID: 5a89de16-eb91-32eb-2af1-00505696b4f0 Local Node Type: NORMAL Local Node State: AGENT Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 2 Sub-Cluster Member Count: 3 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89c7a3-0bca-7bb5-bc26-00505696e28b, 5a89de16-eb91-32eb-2af1-00505696b4f0 Sub-Cluster Membership UUID: c13dab5a-fff4-d93e-0f0e-0050569646fa Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 18 2018-03-16T03:53:05.918 |
Both Node 01 and 03 show 3 members with Node 02 included in the vSAN Cluster with the UUID c13dab5a-fff4-d93e-0f0e-0050569646fa.
Now I need to do the same for ESXi Node 04 and display the vSAN Cluster information for the node.
1 2 3 4 5 6 |
[root@vSAN-04:~] esxcli vsan cluster leave [root@vSAN-04:~] esxcli vsan cluster get vSAN Clustering is not enabled on this host [root@vSAN-04:~] |
But this ESXi node did not automatically add to the other vSAN Cluster. That is typical behavior. If we remove it from the vSAN Cluster, then it should be an ESXi node without vSAN Cluster.
Usually, we should run the command join cluster to add to a vSAN Cluster.
First, I run the get command in one of the nodes to get the proper Cluster UUID to join the vSAN Cluster. Then run the join cluster command using UUID.
1 2 3 4 5 |
[root@vSAN-03:~] esxcli vsan cluster get |grep "Sub-Cluster UUID" Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 [root@vSAN-03:~] esxcli vsan cluster join -u 52b57974-6769-70cc-346a-b99c5762a232 |
But surprisingly, the ESXi Node 04 did not join the vSAN Cluster as an Agent, like it was supposed to, but instead created a new vSAN Cluster again and added ESXi Node as Master.
Remove the ESXi node from that vSAN Cluster, remove the host from the vCenter vSAN Cluster, and try again and again with the same result. I needed to recheck all information and check where the problem is.
Then is when I noticed the ESXi Node 04 system UUID was the same as ESXi Node 03. This was the result of the vSAN Cluster strange behavior. Since ESXi Node 01, 02, and 03 were already fixed and running in one vSAN Cluster, any changes need to be in ESXi Node 04.
Run the command to check System UUID in all ESXi Nodes and confirm that ESXi Node 04 and ESXi Node 03 add the same UUID.
1 2 3 4 5 6 7 8 9 |
[root@vSAN-03:~] esxcli system uuid get 5915a96b-7b7b-5c1b-04e4-0050569646fa [root@vSAN-03:~] [root@vSAN-04:~] esxcli system uuid get 5915a96b-7b7b-5c1b-04e4-0050569646fa [root@vSAN-04:~] |
Note: Since this is a Nested vSAN and Nested ESXi, I may initially forget to recreate the System UUID when I deploy the Template or restore one of the ESXi (discussed in my previous article). Regardless of the root cause, we need to fix it.
How to fix the wrong System UUID?
There is a procedure to fix this, as I have shown in a previous article How to deploy Nested vSAN.
1 2 3 4 5 |
[root@vSAN-04:~] esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1 [root@vSAN-04:~] sed -i 's#/system/uuid.*##' /etc/vmware/esx.conf [root@vSAN-04:~] /sbin/auto-backup.sh |
Afterward, reboot the ESXi to have a new System UUID.
After the reboot, check the System UUID to see if it did change.
1 2 3 4 5 |
[root@vSAN-04:~] esxcli system uuid get 5aab4e37-d8d4-65d0-3931-005056968e4b [root@vSAN-04:~] |
Now I can try to rejoin the host to vSAN. First, I will check the vSAN Cluster UUID once more.
1 2 3 4 5 6 7 |
[root@vSAN-03:~] esxcli vsan cluster get |grep "Sub-Cluster UUID" Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 [root@vSAN-03:~] [root@vSAN-04:~] esxcli vsan cluster join -u 52b57974-6769-70cc-346a-b99c5762a232 |
Displaying vSAN Cluster information for the faulty ESXi Node now I can see that it belongs to the write vSAN Cluster, and I have a 4 Nodes vSAN Cluster.
Note: Moving the ESXi host back to vCenter vSAN Cluster will also rejoin the vSAN Cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[root@vSAN-04:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2018-03-16T05:16:10Z Local Node UUID: 5aab4e37-d8d4-65d0-3931-005056968e4b Local Node Type: NORMAL Local Node State: AGENT Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5915a96b-7b7b-5c1b-04e4-0050569646fa Sub-Cluster Backup UUID: 5a89c7a3-0bca-7bb5-bc26-00505696e28b Sub-Cluster UUID: 52b57974-6769-70cc-346a-b99c5762a232 Sub-Cluster Membership Entry Revision: 3 Sub-Cluster Member Count: 4 Sub-Cluster Member UUIDs: 5915a96b-7b7b-5c1b-04e4-0050569646fa, 5a89c7a3-0bca-7bb5-bc26-00505696e28b, 5a89de16-eb91-32eb-2af1-00505696b4f0, 5aab4e37-d8d4-65d0-3931-005056968e4b Sub-Cluster Membership UUID: c13dab5a-fff4-d93e-0f0e-0050569646fa Unicast Mode Enabled: true Maintenance Mode State: ON Config Generation: 692b71b8-9acc-41cd-bdea-bad68c0b8baf 24 2018-03-16T05:15:56.939 |
Now we have a vSAN Cluster working again, and all VMs are available (with the vSAN Partition issues VMs were in an inaccessible state).
Note: Afterwards, I get some warnings regarding vSAN Disk Balance. That is normal since we have some issues in the partitions, and now we need to run “Proactive Rebalance Disks.” But I am back to this and other health procedures to fix some of the vSAN issues.
Hope this article will provide you with some help on how to troubleshoot some of the vSAN issues and how to fix them.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.
©2018 ProVirtualzone. All Rights Reserved
You saved my day! Thank You
Hi Reza,
Glad to help.