I this blog post, vSphere 7 update 2/3 with HA issue using i40enu driver, I will talk about a problem with HA that could and will happen when you update your vSphere 7 to update 2c or 3.
When you apply these updates on your vSphere 7, your HA will start going crazy and fail if you have HA enabled. It will enable, disable, and every 1/2m trying to reconfigure the HA settings.
Examples:
Then HA stops with
After a couple of minutes, everything starts again with HA trying to enable and reconfigure.
Looking at the logs in HA log(fdm.log), I get this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
cat /var/log/fdm.log | grep error warning fdm[4436626] [Originator@6876 sub=Default] Encountered errors while reading log parameters: error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/60df365f-c07a0091-185d-00110a69322c so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36e04-6773f674-5fee-8cdcd4b27d58 so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36e04-6773f674-5fee-8cdcd4b27d58 so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36e04-6773f674-5fee-8cdcd4b27d58 so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36df2-76a2edc0-e52b-8cdcd4b27d58 so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36e04-6773f674-5fee-8cdcd4b27d58 so file not removed error fdm[4436652] [Originator@6876 sub=Cluster opID=SWI-427c3c55] Couldn't find datastore /vmfs/volumes/5fc36df2-76a2edc0-e52b-8cdcd4b27d58 so file not removed error fdm[4436639] [Originator@6876 sub=Cluster opID=SWI-46cdbe2] Couldn't find datastore /vmfs/volumes/5fc36e04-6773f674-5fee-8cdcd4b27d58 so file not removed warning fdm[4436753] [Originator@6876 sub=Cluster opID=kv993wk1-25923-auto-k04-h5:70002114-f3-01-63] No vmknic found for localhost |
Besides this issue, we cannot apply a patch from and to 7.0 Update 2c/2d or subsequent Update 2 patch failed. It is not possible to apply vSphere 7 Update 3, and stage or remediation will not work.
Sometimes we get: “host returned esxupdate code -1”
The driver i40en triggers this problem. The name of this driver was changed in vSphere 7 update 2 to i40enu, and for some reason, in vSphere 7 Update 3 cannot change back to i40en(by replacing the driver). Then, we have issues using an image to patch, and ESXi will have both drivers because i40en would not replace i40enu during the update. When only one should be running (the proper one).
So to fix this, we need to remove the previous one(that was changed in update 2 i40enu), then reboot, apply the vSphere 7 Update 3 that will replace it with the proper one i40en, and only one should be installed in ESXi.
VMware doesn’t have a fix for this issue yet, and there is only a workaround. Is to remove the i40enu.
How to fix this?
- First, disable HA so that ESXi hosts stop the above tasks.
- Second, put the ESXi host in maintenance mode before you start the following tasks.
First, we need to check what driver we have and check your network interfaces to make sure you use the ones that use this driver ( Intel(R) X710/XL710/XXV710/X722 Adapters).
1 2 3 |
esxcli software vib list | grep i40en |
Next, we will check both drivers in detail.
1 2 3 |
esxcli software component get | grep Intel-i40en -A15 |
After we check that we have both drivers, remove the recent and leave the old one that vSphere 7 update 3 will change the name again, this time when there is only one driver.
1 2 3 |
esxcli software vib remove -n i40enu |
Next, you can reboot the ESXi host. You can now apply vSphere 7 Update 3.
Post upgrade to 7.0 Update 3, do not apply 7.0U2x based baselines or the “critical host patches” baseline until a fix for this issue is made available in future patches. Otherwise, you may hit a similar issue again.
When all ESXi hosts are fixed, apply the vSphere you can now enable the HA back again, and all ESXi hosts should have HA agents installed and enabled correctly.
Important note: This is not a fix, and it is a workaround. VMware will launch a fix or an update that will fix the issue when using update 2 or update 3.
Some information about this problem you can find here in this VMware KB 85982.
Update 08/11/2021:
It seems that are some known issues (too many) with this vSphere 7 update 3. Next, I had a list of all known issues listed by VMware.
Since the release of vSphere 7.0 Update 3 there have been the following reported issues.
Check in this VMware KB all the known issues and fix workaround KB86281
Final Notes:
If you are running into upgrade issues due to intel-nvme-vmd/iavmd drivers, please refer VMware KB 85701 for the relevant resolution/workaround.
It seems that since vSphere 7 was launched is still not stable. Every vSphere 7 update, we have some issues, and some are minor(that is normal), but some are very serious and with downtime on production.
All the above issues have some workaround and only be fixed in the following vSphere 7 update 3a. So, for now, apply the workaround and wait for the final fix.
I have another support ticket for another issue that I have in one of my Cluster. Something I rarely do, but with this vSphere 7 I have more support tickets than in 6.5 and 6.7 together.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.
©2021 ProVirtualzone. All Rights Reserved
Thanks for the post. There are serious bugs in 7.0U3 (https://kb.vmware.com/s/article/86287). KB 86100 can crash whole cluster. Hopefully solved in 7.0U3a. Be careful with it.
Unfortunately yes this vSphere 7 and updates, has a lot of bugs.