A couple of weeks ago, we had some bizarre issue in a vSphere 7 Cluster with ESXi partitions. In this Virtual Flash File System partitions issue blog post, I will try to explain what happened and the root cause.
In this 12 ESXi Cluster, the issue was in 9. Only 3 of the ESXi hosts in the Cluster didn’t have the problem. All ESXi was running vSphere 7.0 update 2b build 18538813.
What was the initial problem? First was not possible to stage any updates and make any changes to the settings of the ESXi. Also, HA was getting a lot of errors, which needed to be disabled.
Since this was after the SD/USB card update and the i40enu driver issue, I started to suspect another similar issue. But this Cluster is not using SD cards but local SSDs. So the SD card is discarded and needs to be something else.
Start the troubleshooting,
Checking the i40enu driver:
1 2 3 4 5 6 7 8 |
[root@ESXi-Host:~] esxcli software vib list | grep i40enu Failed to query file system stats: Errors: Error getting data for filesystem on '/vmfs/volumes/60bcebdb-d2eecd96-4584-a0369f86bc2c': Cannot open volume: /vmfs/volumes/60bcebdb-d2eecd96-4584-a0369f86bc2c, skipping. cause = Errors: Error getting data for filesystem on '/vmfs/volumes/60bcebdb-d2eecd96-4584-a0369f86bc2c': Cannot open volume: /vmfs/volumes/60bcebdb-d2eecd96-4584-a0369f86bc2c, skipping. Please refer to the log file for more details. |
Checking the ESXi partitions:
When I try to check the space of the partitions (sometimes one is full and we get similar problems), I surprisedly get this:
1 2 3 4 5 6 7 8 |
[root@ESXi-Host:~] df -h VmFileSystem: Slow refresh failed: Cannot open volume: /vmfs/volumes/60bccabb-28ca69e6-d220-a0369f86bc34 Error when running esxcli, return status was: 1 Errors: Error getting data for filesystem on '/vmfs/volumes/60bccabb-28ca69e6-d220-a0369f86bc34': Cannot open volume: /vmfs/volumes/60bccabb-28ca69e6-d220-a0369f86bc34, skipping |
No partitions were shown. Check this also in the rest of the ESXi, all the 9 ESXi hosts had the same. No partitions.
Try to check why. All devices were ok, and datastores were working; no issues there. After many hours of troubleshooting and finding no root cause, I decided to open a VMware support ticket.
That one was also a challenge. After the normal initial discussions, the ticket needed to be escalated for a senior support engineer. After the escalation, it needs to be escalated to engineering (which means a bug that is not known), since it is not a common/known issue.
Initially, they suspect something about the UUID that was changing after a reboot(I don’t know why they thought that).
“As is the case here, a NIC failure could result in the MAC picked to generate host UUID during boot to be different. This would result in a mismatch and failure to mount VMFS-L volume.”
But that was not the case and not the root cause.
So this ticket was back and forth for about 6/7 weeks, and finally, they came back with the root cause and the issue.
The issue was caused by the Virtual Flash File System(VFFS) partition. Somehow, the VFFS was corrupted and affected all partitions on the ESXi host. I don’t know if all partitions depend on the VFFS when it is enabled, but this was the root cause.
I was even able to produce this error by manually deleting the VFFS partition with “partedUtil delete “vmfs/devices/disks/naa.6000c29ebbf0754950827dcbd7051475″ 1”
Then when I try to list ESXi partitions I get the same type of error:
1 2 3 4 5 6 7 8 9 10 |
[root@ESXi67-vSAN70-02:~] partedUtil delete "vmfs/devices/disks/naa.6000c29ebbf0754950827dcbd7051475" 1 [root@ESXi67-vSAN70-02:~] df -h VmFileSystem: Slow refresh failed: Cannot open volume: /vmfs/volumes/61fbf63a-1aaf36d2-eb77-005056962d75 Error when running esxcli, return status was: 1 Errors: Error getting data for filesystem on '/vmfs/volumes/61fbf63a-1aaf36d2-eb77-005056962d75': Cannot open volume: /vmfs/volumes/61fbf63a-1aaf36d2-eb77-005056962d75, skipping. [root@ESXi67-vSAN70-02:~] |
This means that somehow the VFFS partition gets corrupted(and in this case, not accessible), and then we lose the connection to the rest of the ESXi partitions. Honestly, I still don’t understand why. It needs more investigation to get more answers about this.
So to fix the issue, we need to reformat the partition(or disable and enable VFFS), and then the problem should be fixed.
How to do this?
We can remove the Virtual Flash Resource and recreate it again, which will fix the issue.
Go to ESXi host, Configure, and under Virtual Flash, select Virtual Flash Resource Management.
After removing the Virtual Flash, I saw the partitions again.
But as we can check above, there is no VFFS partition.
In some ESXi hosts, I could not remove the Virtual Flash, and I got an error.
For these cases, I needed to erase the partition.
Again in the Configure ESXi host tab, in the Storage area, go to Storage Devices and find the partition used by the Virtual Flash. Be very careful, check the Identifier vs Name “naa.xxxxxxxx” partition and disk size in the Virtual Flash Resource Management to make sure you are erasing the correct partition.
After the above procedure, the VFFS partition is deleted.
So now we deleted the VFFS partition and fixed the issue, now let us create the Virtual Flash resource again.
As we have already shown above, go to ESXi host, Configure, and under Virtual Flash, select Virtual Flash Resource Management and click Add Capacity, then select the disks you want to use for Virtual Flash Resource.
After you have your Virtual Flash resource and your VFFS partition is created.
Now we can see all partitions, and the issue is finally fixed.
I hope this Virtual Flash File System partitions issue blog post will help you if you encounter a similar error. I also hope that VMware fixes this bug in a future release.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.
How did you remove the virtual flash resources? I keep getting an error that they’re in use. It says to make sure the host isn’t using it for swap cache or VM read cache, which I confirmed, but I still get the error.
Did you try to erase the partitions where the swap cache is located? Do that carefully. Always double-check if you are erasing the right partition.