Last week we had upgraded some of our HP Blades (HPE BL460c) from ESXi 6.0 to ESXi 6.5. We have used VMware Update manager to migrate/upgrade these ESXi hosts using the ESXi HPE image VMware-ESXi-6.5.0-Update1-5969303-HPE-650.U126.96.36.199.14-Jul2017.
The upgrade did run without any issues. After ESXi upgrade we run a search/scan in VUM for any updates/patches release after this ESXi release, then the problems started. We had an HPE driver bundle (version hpe-driver-bundle-6188.8.131.52) and also a critical VMware Rollup ESXi650-Update1 (both release in 27/07/2017) to apply. After applying those updates, we get this PSOD.
Looking at PSOD details, we see “sfcb-intelci IP addr 0x43911b41c000″ and also in the line “DataAccess5Kbytes@(ixgben)”. Looking at those details, I suspected that a Network Adapter driver was the Root Cause of this PSOD. Not only because of the PSOD details, but also since we were applying HPE driver bundle updating the ixgbe (for our Intel 82599 10 Gigabit Dual Port from the Blade Backplane).
So I rollback the ESXi installation, check HERE how to, and apply the updates again but removed the HPE, only the VMware update 1 Rollup.
But again after applying the VMware Update 1 Rollup get the same issue. So the root cause for this PSOD was not the HPE driver bundle, but the VMware update itself.
This time I need to check the core dump file, check HERE how to extract the core dump file, and troubleshoot the driver that was the root cause for this PSOD. Here I could identify that the problem was the ixgbe driver, but cannot understand why(these core dump files are always too difficult to read).
Googling to see if anyone was having the same issue, I found some similar PSOD with the ESXi 6.5 Update 1 image, but not applying only the 6.5 Update 1 Rollup.
Discovered that these HPE DLs have the same issue:
- HPE DL360p Gen8
- HPE DL380 Gen8
- HPE DL380 Gen9
- HPE DL380p Gen9
- HPE BL460c Gen8
- HPE BL460c Gen9
So next step is to Rollback again the ESXi and open a VMware ticket support using the core dump file.
So VMware official statement is:
“The root cause is the Intel ixgbe driver, which is contained in the critical or non-critical standard baselines of the update manager. As soon as the ESXi with these Baselines AND the HPE image is upgraded to 6.5 Update 1, a PSOD is the trigger.”
Currently, there are two ways to get the host on ESXi 6.5 Update 1 (without PSOD):
- Update here ONLY with the HPE image, as the driver is not included.
Note: This is the option that we use, but still, the Update 1 Rollup always show to install. So this is not a solution.
- You install the update but do not restart the host yet.
For Option 2:
Connect to the host via SSH and either:
- Uninstall the ixgbe driver. Check HERE how to.
- Update the driver to version 1.5.3. Download from HERE the ixgbe v 1.5.3
VMware: “The development is already in the process of investigating the problem more closely and will contact the manufacturers in this regard to work out a fix”
- All “vmklinux” drivers under 6.5 U1 can cause a PSOD during operation.
- The driver with version 1.4.1 (native) causes the PSOD to reboot.
- The only known driver without problem is the 1.5.3.
Note: You have a 3rd option: We can recreate a custom ESXi ISO with the ixgbe version 1.5.3(did not test this, but I may do this in the next days).
Or we wait for a solution from VMware to this ixgbe driver, or we apply the workaround. My recommendation is to apply the workaround (option 2). The problem with this option 2 is if you have too many ESXi hosts, this can be a huge time consuming to perform this tasks. If I have time, in the next days will try to create a PowerCLI to perform this task in all ESXi hosts.
Note: Share this article, if you think it is worth sharing.