/PSOD after apply ESXi 6.5 Update 1 Rollup on HP DLs

PSOD after apply ESXi 6.5 Update 1 Rollup on HP DLs

Last week we had upgraded some of our HP Blades (HPE BL460c) from ESXi 6.0 to ESXi 6.5. We have used VMware Update manager to migrate/upgrade these ESXi hosts using the ESXi HPE image VMware-ESXi-6.5.0-Update1-5969303-HPE-650.U1.10.1.0.14-Jul2017.

The upgrade did run without any issues. After ESXi upgrade we run a search/scan in VUM for any updates/patches release after this ESXi release, then the problems started. We had an HPE driver bundle (version hpe-driver-bundle-650.9.6.5) and also a critical VMware Rollup ESXi650-Update1 (both release in 27/07/2017) to apply. After applying those updates, we get this PSOD.

Looking at PSOD details, we see “sfcb-intelci IP addr 0x43911b41c000″ and also in the line “DataAccess5Kbytes@(ixgben)”. Looking at those details, I suspected that a Network Adapter driver was the Root Cause of this PSOD. Not only because of the PSOD details, but also since we were applying HPE driver bundle updating the ixgbe (for our Intel 82599 10 Gigabit Dual Port from the Blade Backplane).

So I rollback the ESXi installation, check HERE how to, and apply the updates again but removed the HPE, only the VMware update 1 Rollup.

But again after applying the VMware Update 1 Rollup get the same issue. So the root cause for this PSOD was not the HPE driver bundle, but the VMware update itself.

This time I need to check the core dump file, check HERE how to extract the core dump file, and troubleshoot the driver that was the root cause for this PSOD. Here I could identify that the problem was the ixgbe driver, but cannot understand why(these core dump files are always too difficult to read).

Googling to see if anyone was having the same issue, I found some similar PSOD with the ESXi 6.5 Update 1 image, but not applying only the 6.5 Update 1 Rollup.

Discovered that these HPE DLs have the same issue:

  • HPE DL360p Gen8
  • HPE DL380 Gen8
  • HPE DL380 Gen9
  • HPE DL380p Gen9
  • HPE BL460c Gen8
  • HPE BL460c Gen9

So next step is to Rollback again the ESXi and open a VMware ticket support using the core dump file.

So VMware official statement is:

“The root cause is the Intel ixgbe driver, which is contained in the critical or non-critical standard baselines of the update manager.  As soon as the ESXi with these Baselines AND the HPE image is upgraded to 6.5 Update 1, a PSOD is the trigger.”

Currently, there are two ways to get the host on ESXi 6.5 Update 1 (without PSOD):

  1. Update here ONLY with the HPE image, as the driver is not included.
    Note: This is the option that we use, but still, the Update 1 Rollup always show to install. So this is not a solution.
  2. You install the update but do not restart the host yet.

For Option 2:
Connect to the host via SSH and either:

  • Uninstall the ixgbe driver. Check HERE how to.
  • Update the driver to version 1.5.3. Download from HERE the ixgbe v 1.5.3

VMware: “The development is already in the process of investigating the problem more closely and will contact the manufacturers in this regard to work out a fix”

Conclusion:

  • All “vmklinux” drivers under 6.5 U1 can cause a PSOD during operation.
  • The driver with version 1.4.1 (native) causes the PSOD to reboot.
  • The only known driver without problem is the 1.5.3.

Note: You have a 3rd option: We can recreate a custom ESXi ISO with the ixgbe version 1.5.3(did not test this, but I may do this in the next days).

Or we wait for a solution from VMware to this ixgbe driver, or we apply the workaround.  My recommendation is to apply the workaround (option 2). The problem with this option 2 is if you have too many ESXi hosts, this can be a huge time consuming to perform this tasks.  If I have time, in the next days will try to create a PowerCLI to perform this task in all ESXi hosts.

Note: Share this article, if you think it is worth sharing.

©2017 ProVirtualzone. All Rights Reserved

By | 2018-09-20T18:21:50+02:00 September 18th, 2017|VMware Posts|13 Comments

About the Author:

I have over 20 years of experience in the IT industry. I have been working with Virtualization for more than 15 years (mainly VMware). I recently obtained certifications, including VCP DCV 2022, VCAP DCV Design 2023, and VCP Cloud 2023. Additionally, I have VCP6.5-DCV, VMware vSAN Specialist, vExpert vSAN, vExpert NSX, vExpert Cloud Provider for the last two years, and vExpert for the last 7 years and a old MCP. My specialties are Virtualization, Storage, and Virtual Backup. I am a Solutions Architect in the area VMware, Cloud and Backup / Storage. I am employed by ITQ, a VMware partner as a Senior Consultant. I am also a blogger and owner of the blog ProVirtualzone.com

13 Comments

  1. Eric 21/09/2017 at 10:32

    So we had the same problem on DL360p Gen9 with dual 560SFP+ (intel 82599)

    The problem is related to both ixgbe/ixgben driver being install and the wrong one is loaded.

    Step1: You update the host to HPE Custom iso 6.5U1 (27/07/2017)
    after update you connect to SSH and you check what driver is installed by HPE:

    localcli software vib list | grep -i ixgb
    net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-09-12

    If you install the rollup right now you will endup with PSOD

    Step2: Remove sfcb-intel cim provider:
    /etc/init.d/sfcbd-watchdog stop
    esxcli software vib remove -n=intelcim-provider
    /etc/init.d/sfcbd-watchdog start

    3) Start remediation. The host wil reboot.

    4) Connect to SSH and look at the installed drivers
    localcli software vib list | grep -i ixgb
    net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-09-12
    ixgben 1.4.1-2vmw.650.1.26.5969303 VMW VMwareCertified 2017-09-12
    ==> take a look at the supported driver: http://partnerweb.vmware.com/comp_guide2/detail.php?deviceCategory=io&productid=21839&releaseid=367&deviceCategory=io&details=1&VID=8086&DID=10fb&SVID=103c&SSID=17d3&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

    in our case, driver ixgben was causing issues in host vents (vmkernel.log) because it was trying to enable Flow Control transmit frame wich is not supported by the driver. Beware that no alarm is triggered in vcenter (nor in Veeam One).
    every 15s: (unsupported) Device 10fb does not support flow control autoneg

    5) Enable the good driver, disable the wrong one
    esxcli system module set -e=true -m=ixgbe
    esxcli system module set -e=false -m=ixgben
    /sbin/auto-backup.sh
    6) reboot host

    7) optional : reinstall sfcb intel cim provider.

    • Luciano Patrao 21/09/2017 at 12:46

      Hi Eric,

      Thank you for your rely and thank you sharing your process.

      VMware statement is that ixgbe driver that bypass this issue is the 1.5.3.
      I did not test the sfcb, but I think the issue is not associated with this.

      Regarding enable Flow Control, I remember in 5.0 or 5.5, there was a similar issue. Don’t know if this is the same.

      The issue in all workaround if that if we apply the VUM vSphere 6.5 update 1 rollup the issue returns because it replaces the drivers. After we apply the update there is no way to stop the reboot.
      So the only option is like VMware told us, apply the workaround and don’t apply the the rollup.

      In my last communication with VMware say: “The development is already in the process of investigating the problem more closely and will contact the manufacturers in this regard to work out a fix”

  2. Eric 21/09/2017 at 16:09

    The resolution steps I posted are what VMWare suppor provided to me.

    removing the sfcb-intelci cim provider allows you to reboot after the rollup update (because it is this proccess triggering the PSOD when attemping an action not supported by the active driver, like shown on you PSOD screenshot).

    Then you can activate the ixgbe driver provided by HPE, in the case of 6.5U1 custom ISO : net-ixgbe 4.5.1-1OEM.600.0.0.2494585

    • Luciano Patrao 21/09/2017 at 17:53

      Well is different from what they reply to me in the support.

      But it test and if works, in our case I will update the article with that. Off course will attached your name for the solution 😉

  3. Phil 09/10/2017 at 08:48

    Good Morning,

    did you hear something new from VMWare?
    The problem still occurs with the HP ISO VMware ESXi 6.5.0 Update 1-5969303-HPE-650.U1.10.1.3.3-Oct2017 and the update to build 6765664.
    Thanks and best regards

    • Luciano Patrao 11/10/2017 at 03:07

      Hi Phil,

      Did not have any update from VMware regarding this issue. And honestly didn’t have much time to test some of the workarounds. We did not apply of upgrade for now our production systems. But is still on my list to do some tests.
      But the latest HPE ISO from 6.5 you did install in witch HP servers?

      Thank You

      Luciano Patrao

  4. Dominik Zorgnotti 16/10/2017 at 13:55

    Hey folks,

    I do not want to take credit for your groundwork but in my case a simpler way worked (relates to your option 2):

    1) Install the ixgben-driver 1.5.3 before you do anything else (no need for uninstallation of the old one)
    2) Reboot and verify operations
    3) Apply patches, including rollup to ESXi 6.5 U1

    The native drive will force the VUM to obmit the ixgbe-stuff in the updates.
    I am currently in the verification phase to see if this brings any new errors.

    localcli software vib list | grep -i “ixg”
    ixgben 1.5.3-1OEM.600.0.0.2768847 INT VMwareCertified 2017-10-16
    net-ixgbe 4.5.1-1OEM.600.0.0.2494585 INT VMwareCertified 2017-07-27

    localcli system version get
    VersionGet:
    Product: VMware ESXi
    Version: 6.5.0
    Build: Releasebuild-6765664
    Update: 1
    Patch: 29

    If I find time I’ll write a more detailed blog about it, until then I hope someone can use this info.

    • Luciano Patrao 16/10/2017 at 14:21

      Hi Dominik,

      We are all here to share knowledge, so no issues and all information’s and examples are always good.

      Regarding your example, that one was one of my first tries and did not work. After I applied the rollup I get the PSOD. So that workaround did not work for me.

  5. Dominik Zorgnotti 16/10/2017 at 18:55

    Thanks for the information, I will try to update some more hosts over the next days to find a common determinator.

  6. Maciek 03/11/2017 at 09:58

    There is new ISO released today (2017-11-03) which can help resolve this issue:
    VMware-ESXi-6.5.0-Update1-6765664-HPE-650.U1.10.1.5.26-Oct2017.iso

    • Luciano Patrao 03/11/2017 at 14:10

      Hi Maciek,

      Thanks I have notice.

      Plan to try next week if really fix the issue.

      Thank You again for the update.

  7. Mike 05/11/2017 at 19:47

    I just tried the new image and the host will now run for 1.5 minutes and then PSOD.

    HP ML110 w/ P400 controller running esxi off USB

    • Luciano Patrao 06/11/2017 at 22:27

      Hi Mike,

      Thanks for your update.

      I will only test next weekend, since is when is possible to have some downtime for some of our systems.

Leave A Comment