Last week we had upgraded some of our HP Blades (HPE BL460c) from ESXi 6.0 to ESXi 6.5. We have used VMware Update manager to migrate/upgrade these ESXi hosts using the ESXi HPE image VMware-ESXi-6.5.0-Update1-5969303-HPE-650.U1.10.1.0.14-Jul2017.
The upgrade did run without any issues. After ESXi upgrade we run a search/scan in VUM for any updates/patches release after this ESXi release, then the problems started. We had an HPE driver bundle (version hpe-driver-bundle-650.9.6.5) and also a critical VMware Rollup ESXi650-Update1 (both release in 27/07/2017) to apply. After applying those updates, we get this PSOD.
Looking at PSOD details, we see “sfcb-intelci IP addr 0x43911b41c000″ and also in the line “DataAccess5Kbytes@(ixgben)”. Looking at those details, I suspected that a Network Adapter driver was the Root Cause of this PSOD. Not only because of the PSOD details, but also since we were applying HPE driver bundle updating the ixgbe (for our Intel 82599 10 Gigabit Dual Port from the Blade Backplane).
So I rollback the ESXi installation, check HERE how to, and apply the updates again but removed the HPE, only the VMware update 1 Rollup.
But again after applying the VMware Update 1 Rollup get the same issue. So the root cause for this PSOD was not the HPE driver bundle, but the VMware update itself.
This time I need to check the core dump file, check HERE how to extract the core dump file, and troubleshoot the driver that was the root cause for this PSOD. Here I could identify that the problem was the ixgbe driver, but cannot understand why(these core dump files are always too difficult to read).
Googling to see if anyone was having the same issue, I found some similar PSOD with the ESXi 6.5 Update 1 image, but not applying only the 6.5 Update 1 Rollup.
Discovered that these HPE DLs have the same issue:
- HPE DL360p Gen8
- HPE DL380 Gen8
- HPE DL380 Gen9
- HPE DL380p Gen9
- HPE BL460c Gen8
- HPE BL460c Gen9
So next step is to Rollback again the ESXi and open a VMware ticket support using the core dump file.
So VMware official statement is:
“The root cause is the Intel ixgbe driver, which is contained in the critical or non-critical standard baselines of the update manager. As soon as the ESXi with these Baselines AND the HPE image is upgraded to 6.5 Update 1, a PSOD is the trigger.”
Currently, there are two ways to get the host on ESXi 6.5 Update 1 (without PSOD):
- Update here ONLY with the HPE image, as the driver is not included.
Note: This is the option that we use, but still, the Update 1 Rollup always show to install. So this is not a solution. - You install the update but do not restart the host yet.
For Option 2:
Connect to the host via SSH and either:
- Uninstall the ixgbe driver. Check HERE how to.
- Update the driver to version 1.5.3. Download from HERE the ixgbe v 1.5.3
VMware: “The development is already in the process of investigating the problem more closely and will contact the manufacturers in this regard to work out a fix”
Conclusion:
- All “vmklinux” drivers under 6.5 U1 can cause a PSOD during operation.
- The driver with version 1.4.1 (native) causes the PSOD to reboot.
- The only known driver without problem is the 1.5.3.
Note: You have a 3rd option: We can recreate a custom ESXi ISO with the ixgbe version 1.5.3(did not test this, but I may do this in the next days).
Or we wait for a solution from VMware to this ixgbe driver, or we apply the workaround. My recommendation is to apply the workaround (option 2). The problem with this option 2 is if you have too many ESXi hosts, this can be a huge time consuming to perform this tasks. If I have time, in the next days will try to create a PowerCLI to perform this task in all ESXi hosts.
Note: Share this article, if you think it is worth sharing.
So we had the same problem on DL360p Gen9 with dual 560SFP+ (intel 82599)
The problem is related to both ixgbe/ixgben driver being install and the wrong one is loaded.
Step1: You update the host to HPE Custom iso 6.5U1 (27/07/2017)
after update you connect to SSH and you check what driver is installed by HPE:
localcli software vib list | grep -i ixgb
net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-09-12
If you install the rollup right now you will endup with PSOD
Step2: Remove sfcb-intel cim provider:
/etc/init.d/sfcbd-watchdog stop
esxcli software vib remove -n=intelcim-provider
/etc/init.d/sfcbd-watchdog start
3) Start remediation. The host wil reboot.
4) Connect to SSH and look at the installed drivers
localcli software vib list | grep -i ixgb
net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-09-12
ixgben 1.4.1-2vmw.650.1.26.5969303 VMW VMwareCertified 2017-09-12
==> take a look at the supported driver: http://partnerweb.vmware.com/comp_guide2/detail.php?deviceCategory=io&productid=21839&releaseid=367&deviceCategory=io&details=1&VID=8086&DID=10fb&SVID=103c&SSID=17d3&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc
in our case, driver ixgben was causing issues in host vents (vmkernel.log) because it was trying to enable Flow Control transmit frame wich is not supported by the driver. Beware that no alarm is triggered in vcenter (nor in Veeam One).
every 15s: (unsupported) Device 10fb does not support flow control autoneg
5) Enable the good driver, disable the wrong one
esxcli system module set -e=true -m=ixgbe
esxcli system module set -e=false -m=ixgben
/sbin/auto-backup.sh
6) reboot host
7) optional : reinstall sfcb intel cim provider.
Hi Eric,
Thank you for your rely and thank you sharing your process.
VMware statement is that ixgbe driver that bypass this issue is the 1.5.3.
I did not test the sfcb, but I think the issue is not associated with this.
Regarding enable Flow Control, I remember in 5.0 or 5.5, there was a similar issue. Don’t know if this is the same.
The issue in all workaround if that if we apply the VUM vSphere 6.5 update 1 rollup the issue returns because it replaces the drivers. After we apply the update there is no way to stop the reboot.
So the only option is like VMware told us, apply the workaround and don’t apply the the rollup.
In my last communication with VMware say: “The development is already in the process of investigating the problem more closely and will contact the manufacturers in this regard to work out a fix”
The resolution steps I posted are what VMWare suppor provided to me.
removing the sfcb-intelci cim provider allows you to reboot after the rollup update (because it is this proccess triggering the PSOD when attemping an action not supported by the active driver, like shown on you PSOD screenshot).
Then you can activate the ixgbe driver provided by HPE, in the case of 6.5U1 custom ISO : net-ixgbe 4.5.1-1OEM.600.0.0.2494585
Well is different from what they reply to me in the support.
But it test and if works, in our case I will update the article with that. Off course will attached your name for the solution 😉
Good Morning,
did you hear something new from VMWare?
The problem still occurs with the HP ISO VMware ESXi 6.5.0 Update 1-5969303-HPE-650.U1.10.1.3.3-Oct2017 and the update to build 6765664.
Thanks and best regards
Hi Phil,
Did not have any update from VMware regarding this issue. And honestly didn’t have much time to test some of the workarounds. We did not apply of upgrade for now our production systems. But is still on my list to do some tests.
But the latest HPE ISO from 6.5 you did install in witch HP servers?
Thank You
Luciano Patrao
Hey folks,
I do not want to take credit for your groundwork but in my case a simpler way worked (relates to your option 2):
1) Install the ixgben-driver 1.5.3 before you do anything else (no need for uninstallation of the old one)
2) Reboot and verify operations
3) Apply patches, including rollup to ESXi 6.5 U1
The native drive will force the VUM to obmit the ixgbe-stuff in the updates.
I am currently in the verification phase to see if this brings any new errors.
localcli software vib list | grep -i “ixg”
ixgben 1.5.3-1OEM.600.0.0.2768847 INT VMwareCertified 2017-10-16
net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-07-27
localcli system version get
VersionGet:
Product: VMware ESXi
Version: 6.5.0
Build: Releasebuild-6765664
Update: 1
Patch: 29
If I find time I’ll write a more detailed blog about it, until then I hope someone can use this info.
Hi Dominik,
We are all here to share knowledge, so no issues and all information’s and examples are always good.
Regarding your example, that one was one of my first tries and did not work. After I applied the rollup I get the PSOD. So that workaround did not work for me.
Thanks for the information, I will try to update some more hosts over the next days to find a common determinator.
There is new ISO released today (2017-11-03) which can help resolve this issue:
VMware-ESXi-6.5.0-Update1-6765664-HPE-650.U1.10.1.5.26-Oct2017.iso
Hi Maciek,
Thanks I have notice.
Plan to try next week if really fix the issue.
Thank You again for the update.
I just tried the new image and the host will now run for 1.5 minutes and then PSOD.
HP ML110 w/ P400 controller running esxi off USB
Hi Mike,
Thanks for your update.
I will only test next weekend, since is when is possible to have some downtime for some of our systems.