/vSphere 7 Update 2 loses connection with SD Cards a Workaround

vSphere 7 Update 2 loses connection with SD Cards a Workaround


If you are one of the unlucky ones with vSphere 7 Update 2 loses connection with SD Cards a Workaround, then I feel your pain. If you upgrade from vSphere 6.7 to 7 update 2 or install vSphere 7 Update 2 in an SD Card, then have or will get this huge problem soon.

Note: This problem is not related to vSphere 7 new partitions or /scratch or /coredump partition when running on an SD Card. The VMKernel.BOOT.allowCoreDumpOnUsb issue that we had in vSphere U1, fixed in U2 and is not related to this issue.

Information about these: kb2077516 and  kb83376 or in the What is new,

More information about this issue: KB83963 KB83782

Note: Some unofficial statements from VMware employees say that a new patch to fix this issue(and others) will be launch on 15th of July. Let us hope so.

I move all our scratch and coredump partitions to a Storage Datastore, so not running any of those partitions or logs on SD Cards.

What is the problem?

vSphere 7 Update 2 when is running on an SD Card simply loses connection to the SD Cards, and ESXi host freezes. Since there is no access to the SD card and system partitions, the ESXi host hangs, and it is impossible to do anything. All VMs continue to work but not able to power down, power up, no migrations anything.

Then ESXi hosts reach 100% of the CPU, and all VMs have a huge impact on performance. Since it is impossible to power off or migrate any VMs, the only option is to do a hard reset to the server and leave HA to restart VMs on another ESXi host.

As everyone knows, doing this in a Production environment with hundreds of VMs has a huge impact on the company and its running systems and applications.

This is the first thing you see when your ESXi host has the issue.

 vSphere 7 Update 2 loses connection with SD Cards a Workaround

Checking logs, you will see a lot of this:

Particularly this one: NMP device “mpx.vmhba32:C0:T0:L0” state in doubt;   Meaning that SD is no longer available by the system.

After you will lose your ESXi host.

VMware first stated that it was a vendor problem (direct me and others to troubleshoot in the wrong place). Wrong, this is a VMware issue, and it seems to be with the vmkusb driver that triggers this problem.

This is one of the worse bugs that I have seen in a long time in VMware and was also one of the worse issues I have worked with and troubleshoot for a long time and to find the root cause. I have spent so many hours on this issue, that I cannot count.

Initially, I thought this was a problem with HPE (only had this problem with HPE Servers, since my DELL servers run vSphere on SSD Local disks) with the controller driver(vibs) or even the iLO driver (we had updated a week or so before).

Replaced with these updated vibs

  • Broadcom-ELX-lpfc_12.8.329.1-1OEM.700.1.0.15843807_17657023.zip
  • fc-enablement-component_700.3.7.0.5-1_17477831.zip
  • hpessacli-component_5.10.45.1-7.0.0_17771110.zip
  • ilo-driver_700.10.7.0.6-1OEM.700.1.0.15843807_17481969.zip
  • sutComponent_700.2.8.0.20-0-signed_component-17782108.zip

I replaced some vibs from HPE, applied new updates, and thought I was good, and the problem was fixed. WRONG!

The problem is that you need many hours for the issue trigger. It can take 24/48h to return, and if you apply a solution or workaround, you need to wait until the system breaks again or is good.

In forums, we see many people with this problem(in HPE and DELL servers) and many systems down because of this. In our case, we have almost 40 servers with this problem. It is a huge problem and a huge impact on us.

VMware now stated that this problem was not in vSphere 7 Update 2 but also was in U1. Well, I think that statement is not 100% true. We had our systems running without a problem with U1. Only when updated to U2, the problems started. And reading a lot of the forums out there, everyone is stating the same. With U1, no one had this problem. It only started after applying U2.

We could rollback to U1, and we are all good. But the problem is that when you apply new vibs and updates, some of them change the build number, and now you cannot rollback because it will not arrive in U1 but the initial U2.

In your case, if you did not apply any new drivers and updates, and you can rollback to U1, do it. It will save you a lot of headaches. Check HERE how to do that.

So the only solution, for now, is to reinstall all ESXi hosts with a vSphere 7 U1. For environments that have 100 or more ESXi hosts, of course, this is a huge problem. Even for my 40, if a problem. I already reinstall a couple to check the behavior, and even one with a fresh vSphere 7 Update 2 install (since all of the ESXi hosts were upgraded from 6.7).

So, let us talk about the workaround.

Disclaimer: Please use these tasks carefully and always test before you do anything in your production environment.

The only way to bypass this issue vSphere 7 Update 2 loses connection with SD Cards a Workaround is using this workaround without the need to reinstall the ESXi host.

First, when the ESXi hosts are frozen, and you cannot do anything to be able to migrate the VMs without an outage, login to the ESXi hosts console and run:esxcfg-rescan -d vmhba32 and then esxcfg-rescan -a vmhba32.

You will need to run the first command a couple of times until it finishes without an error.

Or

You need to give some minutes between each time you rerun the command. Be patient and try again in 2/5m.

After all, errors are gone and the command finishes without any error, you should see in logs that “mpx.vmhba32:C0:T0:L0” was mounted in rw mode, and you should be able to do some work on the ESXi hosts again.

If you still have some issues, restart the management agents.

You then should see this error on the ESXi hosts.

 vSphere 7 Update 2 loses connection with SD Cards a Workaround

After this, you should be able to migrate your VMs to another ESXi host and reboot this one. Until it breaks again in 24/48h

But after you have all your ESXi hosts running without the issue and you see no quick stats error on the ESXi host monitor tab, you can apply a workaround proposed by VMware.

This is to set the RamDisk to enable or created if it doesn’t exist (with the issue and mounting SD Card again, it can disappear)

You can check that by running the command: esxcli system settings advanced list -o /UserVars/ToolsRamdisk

Int Value by default is disable (0), and we need to set it to Enable (1). You can do that with this command: esxcli system settings advanced set -o /UserVars/ToolsRamdisk -i 1

 vSphere 7 Update 2 loses connection with SD Cards a Workaround

But if you have many ESXi hosts(dozens of hundreds), doing this manually is not good. So I created a small script that you do this.

The script checks if the Ram Disk exists. If not, it will create and check if set to Enable; if not, it will Enable. It will create a small text file to track what change was and in which ESXi hosts.

Note: The actions are commented, run first, and check the file and the created information. If you are ok with that, uncommented and rerun it.

But honestly, I cannot say 100% that this workaround works in all systems. I have done it and don’t see any issues now, but I don’t be surprised if the issues return in more than 24h the issues return.

So, it is ok for now, and let us hope this workaround works in all systems.

UPDATE 24/05/2021:

Just a quick update on this issue. Ater my workaround, in all 30 ESXi hosts that I have applied, after 4 days one ESXi host had the issue again. Need to double-check why this one and if I miss anything on this one for having the issue again.

Also, I would like to say that I am not using the new vmkusb vib driver that VMware support is providing (I did not see any feedback that this driver fixes the issue).

I have systems where I apply the workaround using both versions.

ESXi hosts that I apply the latest VMware updates are using: 0.1-1vmw.702.0.0.17867351
ESXi hosts with no updates are using: 1-1vmw.702.0.0.17630552 (from update 2 only)

The vmkusb vib driver provided by VMware support is: 0.1-2vmw.702.0.20.45179358

UPDATE 05/06/2021:

After a couple of weeks on this issue, the workaround fixes the issue but only until anyone tries to upgrade VMware Tools. This will trigger the issue immediately.

I ask Linux Teams and Windows Teams to upgrade some VMware tools from VMs from a specific ESXi host, to test, and 24 after voila that server has the issue.

ESXi hosts will have the issue in 20/24h after trying to upgrade the VMware Tools. Even the ToolsRamDisk above is set properly. Then we need again to run the esxcfg-rescan -d vmhba32 to fix the ESXi host and then reboot.

After more than a month since vSphere 7 update 2 was launch VMware still did not present a fix for the huge problem.

So what I have seen until now, the issue is in the vmkusb vib driver but also in the ToolRamDisk.

Latest UPDATE 05/07/2021:

Like I have already stated in my comments in this blog post, these are the tasks that I notice can trigger faster the issue.

    • Upgrading VM VMware Tools.
    • Deploying OVA/OVF appliance in this ESXi hosts.
    • Backup VMs with my Veeam Tool
    • When using vCloud Director on this ESXi hosts, the tasks and processes in the vCD will trigger faster the issue.

If the above tasks are done, we notice in less than 24h in one(or more) of the ESXi host in the Cluster.

If these kinds of tasks are not done, we have 1 or 2 per week in some of the Cluster (others 2 or 3).

For last, in a couple of vSAN Cluster, we rarely see this issue. That is something I was not able to understand why yet.

We can do the only thing besides reinstalling all ESXi hosts with vSphere 7 update 1 and wait for the fix from VMware. It was promised that it would be addressed in Update 3, but I think they will launch a fix for this. I hope so.

I hope this blog post about vSphere 7 Update 2 loses connection with SD Cards a Workaround. It can be useful to troubleshoot and apply a workaround on this big problem with this vSphere Update 2.

Meanwhile, you can check a good blog post from David about the vSphere 7 Update 2 and all the changes on the partitions and also regarding this issue.

Share this article if you think it is worth sharing. If you have any questions or comments, comment here or contact me on Twitter.

©2021 ProVirtualzone. All Rights Reserved
By | 2021-07-05T22:41:48+02:00 May 21st, 2021|VMware, vSphere|177 Comments

About the Author:

I am over 20 years’ experience in the IT industry. Working with Virtualization for more than 10 years (mainly VMware). I am an MCP, VCP6.5-DCV, VMware vSAN Specialist, Veeam Vanguard 2018/2019, vExpert vSAN 2018/2019 and vExpert for the last 4 years. Specialties are Virtualization, Storage, and Virtual Backups. I am working for Elits a Swedish consulting company and allocated to a Swedish multinational networking and telecommunications company as a Teach Lead and acting as a Senior ICT Infrastructure Engineer. I am a blogger and owner of the blog ProVirtualzone.com

177 Comments

  1. Lou Corriero 21/05/2021 at 16:49 - Reply

    There is a patch for this if you contact support.

    • Luciano Patrao 21/05/2021 at 17:10 - Reply

      Sorry I have been in contact with VMware support for some days now, for this particular issue there is no fix. They have a new vmkusb driver that we can install, but the issue happens again.

    • Mike 29/07/2021 at 20:01 - Reply

      To be exact, this is what you’ll get from VMWare as of July 28, 2021.

      This is XXXX from VMware Technical Support and I will be assisting you in the support request #XXXXXXXXX.

      Kindly note that this issue is top priority for our engineering team and it will be fixed in the next release which is 7.02P03. Unfortunately, we don’t have the exact ETA but the engineering confirmed that it should be very soon.

      You can subscribe to this kb article to be notified of the release: https://kb.vmware.com/s/article/2143832.

      As for now until the release, to workaround this issue you can use the commands below if the issue happened again:
      esxcfg-rescan -d vmhba32 and then esxcfg-rescan -a vmhba32 (You will need to run the first command a couple of times until it finishes without an error.)

      Then perform a restart of the management agents by running the following commands:
      /etc/init.d/hostd restart
      /etc/init.d/vpxa restart

      Another workaround is that, if the previous version that was installed on the host was ESXi 7.0 U1 we can roll back to that version and the issue should go away.
      To roll back, please check this KB article: https://kb.vmware.com/s/article/1033604?lang=en_US

      Please let me know if you have any question and if any assistance is required from our end.

      • Luciano Patrao 29/07/2021 at 20:23 - Reply

        It makes me smile the reply and the “solution” that VMware provided. Almost a month after I wrote that here 😉

        • Mike Stone 03/08/2021 at 20:07 - Reply

          Also funny, in describing my issue I simply noted the url to your forum post. LOL

          • Luciano Patrao 03/08/2021 at 20:12

            Well is not the first time that VMware support proposes my blog for a possible solution.
            Beginning the year they reply to one of my support ticket with a blog post that I have written. Need to remind them, that I wrote that 🙂

            A couple of years ago, they did the same for an issue I had with vCenter and found a workaround, they reply that was no solution yet but there was a workaround and send my own blog post 🙂

            Is ok by me, as long it helps people with issues. But they should have their own solutions. I think 😉

  2. Lou Corriero 21/05/2021 at 16:51 - Reply

    Hi Luciano, we are actually a VMware Cloud Partner Provider and this hit us really hard. I would like to know if we can connect to discuss this issue?

  3. Lou Corriero 21/05/2021 at 16:54 - Reply

    Cause:
    As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition.

    This ESX-OSData partition is where frequent data is written and combines the product locker and scratch log partitions which were used in previous versions of ESXi.

    This partition is more commonly seen as the /scratch partition.
    • The level of read and write traffic is overwhelming and corrupting many less capable SD cards.
    • The current versions of ESXi OS 7.0 Update 1/Update 2 are no longer throttling I/O to local boot drives.
    • The advanced setting to throttle I/O to local boot drives has been removed in 7.0 U1/U2
    Please feel free to get back to me if you have any questions.

    • Luciano Patrao 21/05/2021 at 17:15 - Reply

      This is not the same issue. That issue was fixed in U2.

      This issue will happen in new implementations or upgrades. As long you use U2. If you are using U1 you will never see this problem. The /scratch and /coredump is a different issue that can be fixed by move the partitions and use the allowBoot in the ESXI boot.

  4. Lou Corriero 21/05/2021 at 17:17 - Reply

    They did that to us as well and we finally were put in contact with the storage team and they provided new bootbank vmkusb drivers for U1 and U2 – this throttled the IO and resolved our issue; however, we did have to replace a lot of SD cards that were corrupted. Also, the patches had to be installed fresh on some “broken” hosts.

    • Luciano Patrao 21/05/2021 at 17:25 - Reply

      On some of the broken ones, I reinstall with U1 nos issues with the SD cards, no card corruption, and no issues with the vmkusb(in DL360 G9 and BL430 G9). Only if I use the U2.

      So for me I not an SD Corruption issue, because the SD is good(if it works fine with U1), but the driver.
      But yes the SD can be corrupted in the process. But is not the main issue here.

      But with the workaround that I explain here, I don’t see any issues until now. So will wait until Monday to have 100% sure that this worked.

  5. Sergey U 24/05/2021 at 13:43 - Reply

    Big thanks for your work and workaround!
    I have been struggling with this problem for a long time on Dell servers, and the server can work for me at least 2-3 days before this problem occurs. for example, the server is frozen today with 30 day uptime. I spent a lot of time updating all the drivers/firmvares and stuff.
    I use a Dell custom image build 17630552

    Recently vmware releaset 7.0U2A
    have you tried this image? HPE Custom Image for ESXi 7.0 U2 Install CD 2021-05-18

    I’m waiting for when it comes out for Dell

    • Luciano Patrao 24/05/2021 at 17:23 - Reply

      Hi Sergey,

      Thanks for your message.

      Yes, I have installed a fresh install from that VMware-ESXi-7.0.2-17867351-HPE-702.0.0.10.7.0.52-May2021. No issues on that server. But I also apply the workaround in this server. So honestly I cannot answer if can fix the issue without the workaround. Since I am tired of o of this issue I will not do any more changes that will break my servers again. The workaround is working, so until I have issues again, or VMware has an official update for this issue, I will not touch and do any more changes on my servers.

      There is also a new vmkusb version that VMware provides when you open a Support ticket, VMW_bootbank_vmkusb_0.1-2vmw.702.0.20.45179358 that they say it will fix it. But I did not try and honestly, I will try for now. I will wait for some feedback or we have an official update with that new vi driver.

      If you need any help, you can send me an email I will try to help you.

  6. Oliver Antwerpen 24/05/2021 at 22:51 - Reply

    I am facing the issue in HPE Custom ISO fpr 7 Update 2a and Synergy SSP2021.05.01. We have opened a Case at HPE and I think we will downgrade to 7 Update1 latest until a fix is published.

    • Luciano Patrao 24/05/2021 at 23:03 - Reply

      Hi Oliver,

      If you still can, then yes should be your first option. Better than applying workarounds or drivers that are not eat tested 100%.

  7. David Onken 25/05/2021 at 16:19 - Reply

    Support provided an older vmkusb .vib (701.0.0.44485813) and this has resolved our issues (along with the ramdisk changes mentioned). Obviously we’re not proceeding further with 7.0.2 until this is completely addressed by VMware.

    • Luciano Patrao 25/05/2021 at 16:50 - Reply

      Hi David,

      Thanks for the update and for sharing the information. All information is very important in this issue.
      Did you apply this old version in your U2? If yes, how long are they worming without any issues?

      Thanks again for the update.

      LP

      • David Onken 25/05/2021 at 17:02 - Reply

        Yes, we applied it to U2. Not ideal situation but two weeks now and no issues. PS: We have HPE hardware.

  8. Iwik 31/05/2021 at 09:22 - Reply

    What is strange, we have seen this issues on HPE running 6.7. Issue started two patches ago. It is not so frequent, it was after 20-30 days and on 4 servers. Opened vmware support case and result was hardware failure – I don’t think so.

    • Luciano Patrao 02/06/2021 at 08:55 - Reply

      Hi,

      I remember that was a similar problem back then in 6.5 I think (I think I wrote something about it). We have a lot of 6.7 and did not see this issue in any of those ESXi hosts. And I did not see in any of the forums people having this issue with 6.7.
      So your issues I don’t think are related to this particular issue.

  9. Javier Flores 01/06/2021 at 10:35 - Reply

    Luciano, thank you very, very much for putting all this information online. It saved us from a big outage today.

  10. Kris 02/06/2021 at 17:43 - Reply

    We have a Cisco UCS and updated to 7.0 U2 last week and the issue came up yesterday. Already had the scratch disks moved so trying the Ramdisk workaround. Thank you for posting this information. We were able to keep our production workload online even when the host showed disconnected and recovered. Any update on a new driver?

    • Luciano Patrao 03/06/2021 at 14:46 - Reply

      Hi Kris,

      Until now I don’t have any updates. Still waiting for VMware to launch vSphere 7 update 2b (I hope). Since I have heard that they plan to launch update 3 in August and I hope they do not wait until then to launch a fix for this that long.

  11. Jean-Sebastien D'Amours 02/06/2021 at 22:26 - Reply

    I had that issue on May 21rst with a brandnew Dell R740 freshly installed ESXi 7.0 U2a 24h ago. The problem occurs during the offline migration of a vm. I was not understanding what was happening. I tought that I was badlucky and got a faulty PERC controller or bad disks or other hardware problems. vmware completed the migration of the vm and removed it from the source server. Once done, I try to start it but all seems to be freezed. Finaly the vm starts and when I shutted it down, nothings happens. Try to force to stop but options becomes greyed out. Try to browse the datastore and cursor was spinning and nothing was showing. Try to restart the ESXi via vsphere client and nothing happends. Try to restart from ESXi console and a message appears that said I don’t have rights to do that…Finaly push the power button until it power-off and restart it. At that moment, I was not sure if my new server was better the old one, I had doubts !

    Once returned back to normal, I found my migrated vm “orphaned”. Try to reintegrate it into vcenter and it desapears from vcenter and from datastore. Had to restore it from my backups.

    Finaly, I completed the migration without any problems in the week end. On Tuesday 25th, Veeam was performing backups and all the process freezes. I looked at this to understand what was happening, using that software since 2014 without any issue, it appears that the problem is back. In my search on google, I found your blog with a description of the problem I had. At first I was rassured that it was not a hardware problem and also to know the source and the work-around command lines.

    Thank you very much for that post.

    • Luciano Patrao 03/06/2021 at 14:50 - Reply

      Hi Jean,

      I am glad to help.

      Regarding the orphaned VMs should still exist in the datastore if you browse them looking for the folder/files for that VM you should have it and you just need to add them back to inventory (removing first the orphaned). If you restored you should double-check your datastores so that you don’t have any trash files(orphaned VMs that are not used anymore) that is consuming space in your Storage. Veeam ONE has a good feature to find this trash files.

  12. Jeff Creek 03/06/2021 at 14:48 - Reply

    FYI –
    esxcli system settings advanced list -o /UserVars/ToolsRamdisk
    Unable to find option ToolsRamdisk

    Dell hardware
    VMware-VMvisor-Installer-7.0.0.update02-17867351.x86_64-DellEMC_Customized-A03.iso

    So, rolling back to U1 may be my only option.

    • Luciano Patrao 03/06/2021 at 14:54 - Reply

      Hi Jeff,

      If you read the script that is what it does. Double-check if exists, then if not, it will create the ToolsRamdisk. But like I said, if you have the option to rollback, I would do it.

      • Jeff Creek 03/06/2021 at 15:10 - Reply

        Hi Luciano,

        This is preprod. So, I plan on going back to 7u1.

        Thanks!

  13. Jeff Creek 03/06/2021 at 14:50 - Reply

    Looks like you have to add it yourself.

    https://kb.vmware.com/s/article/83782

    • Luciano Patrao 03/06/2021 at 14:56 - Reply

      In my infrastructure only one ESXi host did not have the ToolsRamdisk, the rest all was ok no need to add.

  14. Jone Merakus 03/06/2021 at 20:47 - Reply

    Thanks for this! We encountered the problem yesterday after updating this weekend to 7 U2.

    What exactly does creating the RAM Disk do however? I’m a little confused about *why* that resolves the issue.

    • Luciano Patrao 03/06/2021 at 22:14 - Reply

      Hi Jone,

      ToolsRamDisk doesn’t fix this particular issue. It fixes one of the issues that we encounter when using vSphere 7 update 2.
      Is part of some changes and tasks that we need to do to workaround the issue with SD cards.

      About Rmadisk, to make short, here you can have a small explanation of what is and what it does HERE

  15. Jone Merakus 04/06/2021 at 19:37 - Reply

    Thank you for the explanation. As a point of clarification, do you have to reboot the host after setting the RAM disk?

    Thanks!

  16. Enrique Alonso 07/06/2021 at 10:39 - Reply

    This post is pure gold! thank you very much!
    Thanks to it I was able to bing back to life our hosts and migrate the VM without problem.
    I had a support case with vmware and the lack of interest and effort they put into it was remarkable.

    • Luciano Patrao 07/06/2021 at 22:06 - Reply

      Thanks for the reply.

      Yes, I agree. I still not fully understand VMware position in this huge issue.

  17. Fred Vertuel 07/06/2021 at 15:09 - Reply

    Hello Luciano,

    Same issue on 18 Dell Poweredge r440, with Dell custom image 7.0.2a…

    The toolramdisk option didn’t fix the issue for at least 5 servers (I activate the option 2 days ago). I have installed the vmhusb vib update provided by VMware and mentionned in your post. Servers with VIB updated are working for 2 now, but I got warnings on those specific servers… My case is still open with vmware… Wait and see

    Thankls for you post, it allows me to get faulty host back to life, and at least reboot it properly

    • Luciano Patrao 07/06/2021 at 22:07 - Reply

      Mine is also open and now I receive an email that since there were no updates, it went to archive 🙁

  18. Patrick Long 08/06/2021 at 06:10 - Reply

    Luciano – this is incredibly well-written and great information that really helped me. I’m in a bit of a dilemma as I can’t roll back. I upgraded my HPE Synergy hosts to U2 (image HPE-Custom-Syn-AddOn_702.0.0.10.7.5-14) to escape an issue I was having with Synergy nodes on prior image HPE-Custom-Syn-AddOn_700.0.0.10.5.6-19 where under I/O load hosts were showing very high KAVG values – even a single vm of minimal I/O was showing >1 KAVG which should never happen. Upgrading resolved the KAVG issue but just tonight I have had my first host lose connectivity to micro-SD card boot device – which I resuscitated using your information above. I have a support call with VMware tomorrow and will advise here if I get any additional information or confirmation of the /UserVars/ToolsRamdisk workaround.

    • Luciano Patrao 09/06/2021 at 08:37 - Reply

      Hi Patrick,

      Yes, vSphere 7 update 1 had some issues and also in the partition’s boot that is why VMware decided to change the partitions structure in update 2. A big change and should be done differently not in an update.
      What I know until now is that after vSphere 7 update 1c customers see issues. In my case, was running this version before in almost 100 ESXi hosts without any problems.

      So if you cant rollback, you have only two options here, continue with update 2 and will fix the issues from time to time until there is a proper fix for this, or install from scratch all ESXi hosts with update 1 or update 1c.

      Unfortunately is the only option we have at the moment.

      Thanks for your update.

  19. fvertuel 08/06/2021 at 13:01 - Reply

    Have you finally try to install the VIB updated? and if yes, what was the result on your side? thx

    • Luciano Patrao 09/06/2021 at 08:38 - Reply

      No, I decided not to put more stress and issues in my infrastructure. Since I have been very busy, did not even test this in a test server.

  20. Georgi Petkov 08/06/2021 at 16:28 - Reply

    Please add this command:
    esxcfg-advcfg -A ToolsRamdisk –add-desc “Use VMware Tools repository from /tools ramdisk” –add-default “0” –add-type ‘int’ –add-min “0” –add-max “1”

    🙂 thanks for this you’ve saved my life 🙂

    • Luciano Patrao 09/06/2021 at 08:42 - Reply

      Hi Georgi,

      That is what my script does. It adds ToolsRamDisk if doenst exists in the ESXi hosts running in SD Cards.

  21. Adrian James 08/06/2021 at 19:37 - Reply

    Same issue here on Cisco UCS B200 blade servers booting from SD card. Thanks for your rescan workaround, it helps to recover the host enough to evacuate it. I rolled all prod back to u1 but 4 hosts had the same u2 update on both bootbanks, so they will be a rebuild back to u1. My ticket with VMware is also going nowhere.

    • Luciano Patrao 09/06/2021 at 08:40 - Reply

      Good that you can roll back, unfortunately, I don’t have that luck for all my servers.

  22. Patrick Long 09/06/2021 at 20:21 - Reply

    I am implementing the /UserVars/ToolsRamdisk recommendations now because I can’t roll back to 7.01 (installed Nimble NCM after upgrading, so now both bootbanks are 7.0U2a), but it is interesting that the KB 2149257 describing this process does NOT even mention ESXi 7.x in the “Related Versions:” section, only ESXi 6.5 and 6.7. I am interested if anyone knows whether the “read operations to the SD card to access VMware tools” could still an issue for installations that have redirected productLocker to a shared storage location, as we have, seen in ls -n : productLocker -> “/vmfs/volumes//SharedLocker”, or if this method is an equally effective alternative to loading vmtools in a ramdisk in terme of reducing I/O to the SD boot media? Another question I have is regarding the new DRS mechanism using the vCLS vm’s (over which you as the vSphere admin have NO control over initial deployment location) – perhaps these system-generated DRS vm’s are ending up on the SD card boot media somehow? See https://www.yellow-bricks.com/2020/10/09/vmware-vsphere-clustering-services-vcls-considerations-questions-and-answers/ . I find these vm’s littered about thoughout my cluster’s datastores, even those meant for short-term existence like mounted SAN snapshot datastores. Super-frustrating that there is not an option to limit the scope of these deployments via a “Use datastores only from the specified list” similar to the HA heartbeat Datastores selection policy.

    • Luciano Patrao 10/06/2021 at 16:33 - Reply

      Hi Patrick,

      For the ToolsRamDisk and /scratch partition move to a datastore, unfortunately, yes we need. All my ESXi hosts have their locker in a partition file set in a Datastore and still, I need the ToolsRamDisk implemented.
      Like I said in my article, if you are using SD Cards for your ESXi host installation, then you always need to move the .locker to a datastore so that you don’t have issues (regardless of the update 2 issue).

      For the SD cards vs DRS vCLS VMs, how can those VMs move to SD Cards? That could be true if you are creating a datastore with the free space of the SD Cards(something you should never, never do). But vCLS is set in shared Storage to run, only uses local datastore if you do not have a Shared Datastore.

  23. Patrick Long 11/06/2021 at 18:09 - Reply

    Agreed about the vCLS vm’s – they could not be on the SD card unless there is a datastore there, which should never be done. My only point is that this new “feature” for DRS is not terribly well documented so who knows what if any hooks if any there could bee to a host’s boot device, heartbeating it for host liveness, etc.. Probably none but I’d like to know for sure.

    On the other issue /scratch partition, yes I also agree the symbolic link to scratch should ALWAYS be pointed to a datastore backed by high-endurance media so that .locker is not on the USB or SD boot media – and I do this as well. But what I was talking about instead was the symbolic link to productLocker, which I believe is where the host will look for it’s local vmTools bits to compare against the running tools version on the vms it hosts. By default in a vanilla ESXi installation this will be pointed to vmTools on the boot media; Per KB 2129825, I have changed this symlink to a folder on a datastore accessible by all my hosts so that when a new vmTools is released I simply upload the new tools bits once to the shared folder and restart mgmt agents on my hosts (or move vm’s around to new hosts), and instantly all my vm’s tools current version status is compared against the new version in the shared location – NOT whatever tools version came with the ESXi version on the host that the guest vm lives on. In this way, I can have hosts of various versions in the same cluster (during a round of ESXi upgrades for example) or even entire clusters of different versions (like a cluster of Gen8 running ESXi 6.5) and no matter which host a vm moves to, its tools versions is always compared against the tools version in the shared location, not the local version on the host. In this way they do not switch VMware Tools Version Status from “Current” to “Upgrade Available” as they move between different hosts or different clusters if such a move becomes necessary.

    I’m not sure how common it is to do what I described above, but where it IS relevant to this discussion of lowering I/O to the boot device to prevent disconnection when vmTools upgrade is invoked (as you have found) is that I suspect having the vmTools on a shared datastore location is already *functionally equivalent* and just as good of a remediation as enabling /UserVars/ToolsRamdisk, since my hosts should not be hitting the local tools bits on SD for either vm tools upgrades or for tools version comparisons against running vms. I have asked GSS for their opinion on this. I will enable UserVars/ToolsRamdisk on a host and see if the productLocker symlink changes to something other than my shared datastore location…

    • Luciano Patrao 11/06/2021 at 18:32 - Reply

      Good point on that Patrick, but I never used that kind of configuration so I cannot answer that question.
      First is too much manual work, second, we do not mix versions in Clusters and we should since this is not the best practice.

      And honestly, I never touched the symlink. And if I had some time I could test that solution to see if I get any results when upgrading VMware tools. But at moment I don’t have time for that.

      Would like to finish by thanking you for your great contribution on this subject and for providing very good information.

    • Leo Kurz 20/06/2021 at 12:52 - Reply

      Any news on activating the ToolsRamdisk after redirecting the ProductLocker to a shared disk?
      I have no experience which ESX 7, but I have installed ESX on SD/USB devices for many years and there’s no way of replacing all boot devices in every server just to update to version 7. From what I understand so far, redirecting the scratch partition (KB1033696) and redirecting the ProductLocker (KB2129825) to shared (SAN) device should solve the problem. I used both so far until 6.7 but I’m not totally aware of the implications with the new partition layout. Perhaps someone could help/clarify:
      – Would redirecting both links with adv. settings to a capable shared disk/LUN solve the problem?
      – Would the RAMdisk be still necessary?
      – Up to now, redirecting scratch also redirected log and coredump. Is this still valid?
      – From what I understand, when you set the advanced parameter “/UserVars/ProductLockerLocation” and reboot, the redirection of the symlink is not necessary
      As I use scripts to assist deployments, the above changes would not be a major effort and would solve the problem in a supported (KB) way w/o any workarounds, downgrades or special patches from support.
      Any ideas/input?
      __Leo

      • Luciano Patrao 21/06/2021 at 17:29 - Reply

        Hi Leo,

        Lot of questions, let me try to answer.

        First, best practices say that you will always should have the locker/coredump in a datastore(if you are using SD Cards). That is the previous of this U2 issue. So you should always set that.
        — Would the RAMdisk be still necessary
        Yes in the KB from VMware I don’t see that is one or another.

        I have in my list to create a second blog post about this issue with some updates and also scripts that will automatic create all these changes. First, time is always an issue and also I am been a bit ill, so this is why I don’t reply too fast and did not write any big updates in the last 2 weeks.

  24. Perttu 14/06/2021 at 13:47 - Reply

    Many thanks for this blog post. It helped us a lot!

  25. Philipp Menzi 14/06/2021 at 15:59 - Reply

    I have the same problem here on two customers. One has HPE Hardware with SD Card ( upgraded from 6.7x to the newest version 7.x ) other Customer with Cisco HW ( upgraded from 6.7x to the newest version 7.x ). We have the same Problem on both customers ! No New patches are avaible to fix that problem.Thank for your blog, i try your fix and hopefully vmware will release a fix soon !

    • Luciano Patrao 15/06/2021 at 11:36 - Reply

      There is no official date to launch update 3 with the fix. But VMWare employee wrote in the VMware communities forum that it will be launch in July.

  26. Wal Dimer 15/06/2021 at 02:26 - Reply

    Thanks so much Luciano, we were tearing our hair out with this one.

    Just finished an argument with the VMware support team who couldn’t understand why I thought two SD cards in two different server generations wouldn’t just start breaking within a week of each other and that there might be more to it. Then I found your and others research.

    I too cannot understand why VMware haven’t dropped everything to fix this or generate a solid workaround. Must be too busy deprecating AD auth.

    • Luciano Patrao 15/06/2021 at 11:40 - Reply

      Yes in my opinion, not the best support handling on this issue no.
      But I am glad that at least I am able to help some people with the workaround and have their systems back while we wait for the promised fix in July.

  27. Johannes Weidacher 17/06/2021 at 15:43 - Reply

    Thank you for you post and also your PS but there is a small error
    If ($Setting.Value -eq $false)  {

    should be
    If ($Setting.Value -eq 0)  {

    • Luciano Patrao 21/06/2021 at 17:24 - Reply

      Hi Johannes,

      Thanks for the command. But since 0 is false, and 1 is true, the command will have a false or true. So you can use both that it will work.

  28. Luke 18/06/2021 at 17:42 - Reply

    We have encountered another novel error with v 7.0.2: unable to correctly remove snapshots and consolidate disks. Every snapshot created and removed, a disk consolidation warning appears and the machine must be turned off (!) to successully consolidate. This is on (3) DL380 Gen 9 booting from microSD and an MSA2040 SAN. After all the problems we have read about and the SD issues we reverted back to 6.7 and will sit and wait patiently as we don’t really want to beta test this for vmware on production servers.

    • Luciano Patrao 21/06/2021 at 17:21 - Reply

      Hi Luke,

      When I apply the U2 I get a couple of those, but since this is normal from time to time I didn’t take much attention to that issue related to U2. But I did not get more of those after those initial ones.
      So I cannot state that this is an issue in the U2.

    • Lukas Lang 29/06/2021 at 10:25 - Reply

      We had the consolidation issue after Upgrading from 6.7 EP15 to U2a. A fresh install of ESXi directly to Update 2a resolved the issue (along the hanging boot on the vmw_satp_alua loaded “successfully” bug)

      • Luciano Patrao 29/06/2021 at 12:33 - Reply

        Hi Lucas,

        Always monitor those U2a installations. We have at least 10x new installs with U2a and still have issues.

  29. David Pasek 22/06/2021 at 15:23 - Reply

    Hi Luciano,
    first of all, thanks for your blog post. Very informative and useful, mainly because the workaround which works for my customer experiencing the same issue.

    Disclaimer: I work for VMware as TAM

    You mention in your post not using the new vmkusb vib driver VMware support is providing. Is there any reason behind your decision?

    We are trying to get new vmkusb vib driver and validate if it resolves the issue.

    David.

    • Luciano Patrao 22/06/2021 at 17:39 - Reply

      Hi David,

      First thanks for dropping by and for your message.

      I will give you a couple of reasons. First, my support ticket open was not handled properly and not 100% honest. When I show what was the problem and that is nothing related to a vendor or other issues. Or support wrongly pointing me to different paths and wasting time troubleshooting when the issue was completely different.
      Secondly, I have wasted so many hours on this issue, troubleshooting, testing, fixing, and finding a workaround for at least we have a way to put servers in production without having a huge outage in our running VMs, that I do not want to use vmkusb version that is still not 100% that will fix, or reduce this issue from happening.

      Even we have dozens of ESXi hosts, I do not have any ESXi hosts where I can test this properly(out of production) do honestly, I will not spend many hours again on this to try to fix it, when at least I have a minimum stable environment(the issue still appears time to time in a couple of servers, but is manageable.)

      Talking to some other customers using the new vmkusb, it did not fix the issue 100%.

      • David Pasek 22/06/2021 at 17:48 - Reply

        Thanks for explanation. It makes perfect sense and thanks again for your hard work because I feel your pain.

        Anyway, I will work with my customer and VMware GSS to fully understand the root cause and fix because it is really annoying issue. I’ll keep you updated.

        David.

  30. David Pasek 22/06/2021 at 15:29 - Reply

    Btw, recently I wrote the blog post “vSphere 7 – ESXi boot media partition layout changes” where is the section about various known problems you can observe when using USB or SD media.

    I’ve referenced your blog post and your workaround. I believe you don’t mind.

    My blog post is available at
    https://www.vcdx200.com/2021/06/vsphere-7-esxi-boot-media-parition.html

    • Luciano Patrao 22/06/2021 at 17:41 - Reply

      Of course, you can. Is all about sharing content and helping. I will also include in my original blog post a link to your blog post.

  31. Jarrad 24/06/2021 at 03:28 - Reply

    We just got ESX 7.0.2 build 18049868 from support that contains vmkusb 0.1-4vmw
    They said public release will be U3 ~15th Jul.
    Haven’t had a chance to deploy yet due to change control lead times so no idea if it’ll fix the problem

  32. floritto 24/06/2021 at 10:31 - Reply

    Hi

    Thank you for this post Luciano, it helped us a lot.

    After we got no help from VMware for several weeks on this issue other than “wait for the the next release” we escalated through management.

    Now it turns out there is a hotpatch available for this. We only learned about this after escalating, support did not tell us about it.

    The hotpatch has to be approved on a per customer basis. We didn’t get it yet so I can’t tell if it really fixes the problem. Just wanted to let you know there is might be something available that solves this bug. Ask Vmware support about it.

    • Luciano Patrao 24/06/2021 at 14:04 - Reply

      Hi Florito,

      Yes, I know that they are providing some new beta releases for some customers. But honestly I not implement anything of those not testing versions. And for sure I will not use my production environment to use as beta test for VMware.

      But thanks for sharing.

  33. Patrick Long 24/06/2021 at 18:42 - Reply

    For Jarrad and anyone else with knowledge or who actually has their hands on the new ESXi build 18049868 with new vmkusb 0.1-4vmw due for public release mid-July – is there any indication of exactly HOW the issue was addressed? Does it resume the I/O throttling to USB present in prior releases or deal with the issue in some other more complex way? Were you given anything in terms of release notes or a fixlist that defines exactly HOW this problem was mitigated in the new build/driver?

    • Luciano Patrao 25/06/2021 at 15:43 - Reply

      Hi Patrick,

      Those questions of course need to be ask to VMware 😉

      But we hope we get more official information soon.

  34. Lukas Lang 29/06/2021 at 10:39 - Reply

    Thanks for the great post and the Updates regarding this annoying issue. We have a Test Host installed on SD (BL460c Gen10) running about 35 VMs without any issues for 11 days, for now. We redirected productLocker for many years since there was no practical way to manage and centralize VMware Tools. Like others said before, redirecting the location of the Tools seems to reduce the IO Load heavily. Redirecting /scratch and coredump should also reduce unnecessary load. Hope they will get a fix and proper documentation on this, since our upgrade project is frozen now. In my 10 year VMware career I never experienced such a big issue and the initial U2 release is now almost !4! Months old. We used to have PSODs with faulty drivers, that were fixable, not SD cards getting stuck and shot in the nirvana.

    • Luciano Patrao 29/06/2021 at 12:34 - Reply

      We have in one environment 10x BLC460c Gen9 and 10x G10, at least 2/3x times a week we have issues.

      and yes I agreed that is one of the worse bugs I have seen since I work with VMware. And this is since v2/2.5 😉
      😉

  35. robert 01/07/2021 at 23:54 - Reply

    I’m facing this issue as well and have a support ticket open with vmware. So far they only provided the two rescan commands and restart commands as a workaround they say but after reading this article I’m guessing those only get it out of the unresponsive state and don’t fix the problem. Just wanted to post here so I could subscribe!

    • Luciano Patrao 02/07/2021 at 23:41 - Reply

      Yes, the workaround copy from here 😉

      No issues, is not the first time that VMware support provides one of my blog posts as a solution. I even open tickets in VMware and they have proposed me a solution that I wrote in my blog 🙂

      PS: You don’t need to comment to subscribe. But glad that you did, all share is important.

  36. Matt 05/07/2021 at 12:23 - Reply

    FYI: vmware support gave me this kb and a note that a patch is expected this month: https://kb.vmware.com/s/article/83963

    • Luciano Patrao 05/07/2021 at 15:24 - Reply

      Thanks Matt, I will update the blog post with that KB.
      The information that I have is that it to be launch around 15th of July, Let us hope so.

  37. Alex W. 05/07/2021 at 16:10 - Reply

    Hi,

    is there any update on this? Can you send a HPE case number for referencing ? We have the same issues on a few esxi. Can you explain what exactly is the problem coming from? We have re-directed the locker folder to local store – but the problem also come back after 24-48 hours. If i redirect all these Folder to local stores where is the “heavy load condition” ?

    Alex

    • Luciano Patrao 05/07/2021 at 22:32 - Reply

      Hi Alex,

      I don’t have any HPE ticket open, only with VMware. Since its not a vendor issue, but a VMware issue.
      Changing the locker to a datastore is Best Practices when using SD Cards. Regardless of this bug.

      In this latest VMware KB, it explains a bit about the issue.

      Also like I stated in my blog post, upgrading VMware Tools, deploying OVA appliance, and also Backups (just notice and test this just a week ago) can trigger faster the issue.

      I think is because of the rw in the SD Cards (a bit explained in the KB).

      At the moment, after the changes that I explain in my blog post, I have 2 or 3 times the issue per week in about 50 ESXi hosts.
      I have also a couple of vSAN using BL460c G9 blades with 7 servers each, and rarely the issue is trigger. Still did not understand why (all have the same settings that I explain here).

      • Marco Corleone 06/07/2021 at 10:02 - Reply

        We have the same issue on two of three hosts (brandnew Dell server from June 21). The trigger here is a vSphere shutdown task before backup with Veeam. This issues happens not all the time, but when it happens, the shutdown task was started at this time.

  38. Adam Tyler 13/07/2021 at 17:19 - Reply

    This is unreal. I have this problem too, spent hours of my life reloading ESXi and removing VIBs I thought may be the root cause. Completely unacceptable that VMware still distributes this broken build of ESXi. I’m on VMware ESXi, 7.0.2, 17867351 ..

    Regards,

    Adam Tyler

    • Luciano Patrao 13/07/2021 at 22:26 - Reply

      Hi Adam,

      Yes, I feel your pain. And also, I don’t understand why VMware is still providing this version with this big issue.
      Honestly, I think they are trying to do, what they did since the beginning, blamed vendors and the issue on vSphere 7.0.2a is just a consequence of the vendor fault, not directly from them. That is the only logic here, and is also what they have said to many customers when replying to support tickets (like mine).

  39. Adam Tyler 13/07/2021 at 23:03 - Reply

    Do we know what previous build of ESXi is not impacted by this issue? I’m looking at downgrading at this point. Going to be painful, but….

    I’m on build: VMware ESXi, 7.0.2, 17867351
    it is definitely broken.

    Currently installed vmkusb vib:
    vmkusb 0.1-1vmw.702.0.0.17867351

    • Luciano Patrao 13/07/2021 at 23:16 - Reply

      Is in the blog post 😉

      Version VMware ESXi, 7.0.1.x

      Until version ESXi 7.0 Update 1c I did not see any issues.

  40. Adam Tyler 13/07/2021 at 23:30 - Reply

    Man they have so many versions. Feel like their unpaid beta tester at this point.
    So you are saying “VMware-ESXi-7.0U1c-17325551-depot” and newer, bad?

    So release “VMware-ESXi-7.0U1b-17168206-depot” and older are good?

    Wonder how security patching figures into all of this. Like to patch the latest exploit in the 7.x branch do you need to be running the latest build of 7u2 or do they release security patches for the 7u1 branch?

    Regards,
    Adam Tyler

    • Luciano Patrao 15/07/2021 at 19:16 - Reply

      You apply the latest ISO 7.01 using life cycle manager(previously VUM) and you are ok. If are going to apply all automatic patches, and not don’t do a manual upgrade with an ISO, you will get U2a.
      And then wait for the U3.

  41. Rob 15/07/2021 at 15:51 - Reply

    Its the 15th! Any word on the patch release?

    • Luciano Patrao 15/07/2021 at 19:18 - Reply

      Yes, it is and unfortunately nothing is on the horizon 🙁

  42. Jason 15/07/2021 at 17:10 - Reply

    Great article!!!! I kept running into this issue during snapshots being taken. I thought it was an issue with my backup software. You saved me a lot of hassle going through VMWare support. FYI, Dell has told us that they will not support vSphere 7 on SD cards. That probably explains the vendor finger-pointing that VMWare seems to be doing, and their “lack” of enthusiasm fixing this. Looks like both side feels that customers should not be using SD cards, which is odd. We’ve used them for years without an issue.

    • Luciano Patrao 15/07/2021 at 19:21 - Reply

      Yes unfortunately is finger-pointing now VMware and vendors. But also Dell cannot just say they do not support SD cards for vSphere 7 when they sold servers with that, and for that. Yes is not their fault, but still, they cannot say that at this stage.
      If you can, and if your backup tool has this feature, use storage snapshots and not VMs snapshots. At least you will reduce the times the issue will appear.

      • Jason 23/07/2021 at 16:13 - Reply

        Okay, wanted to give an update. Our host ran successfully for 8 days after applying the workaround. Backups ran normally. No issues. Then, we started to see “Aborted Disk Commands” on the USB device from our RMM monitoring. I looked into the host. I could still login. But when I went to logs, the screen would lock up for over 5 minutes. Even when on console window on the host via the iDRAC, looking at the logs menu would lock up the screen. When the menu became responsive again, I was able to enable SSH. I ran the “esxcfg-rescan -d vmhba32” and “esxcfg-rescan -d vmhba32” commands. I did get errors and was able to clear them. VMWare logs started to display again. “NMP: nmp_DeviceStartLoop:740: NMP Device “mpx.vmhba32:C0:T0:L0″ is blocked. Not starting I/O from device.”. So, it looks like this fix is temporary. Found an article on VNinja that shows someone who came across the 8 day issue, same as me. “https://vninja.net/2021/06/01/esxi-7.0-sd-card-issue-temporary-workaround/”. Looks like we HAVE to wait for that VMWare patch, which we will be applying ASAP.

        • Luciano Patrao 24/07/2021 at 02:15 - Reply

          As I say in my blog post, my workaround is just that, a workaround to reactivate the ESXi host and be able to move VMs and reboot the server. This is not a fix or a temporary one.
          I am tracking my ESXi hosts issues and is random. It can take 48/72h or a week to have an issue. But never in the same ESXi hosts for at least 2 weeks.
          So yes, we are all anxiously waiting for the VMware fix for this issue,

  43. Jonny 15/07/2021 at 20:21 - Reply

    Patch is postponed until late August …

  44. JD 16/07/2021 at 19:01 - Reply

    Got an update from our TAM. ESXi 7.0 U3 has been pushed out Sept. 21, 2021 based on the beta testing results. Still waiting on issue/resolution for bootbank errors and SD card getting unregistered.

    • Luciano Patrao 16/07/2021 at 22:13 - Reply

      Different dates, different rumors. My inside connections say the end of Augsut.
      Either way, is too long, too long…

  45. Adam Tyler 16/07/2021 at 23:09 - Reply

    This sucks. At this point it looks like downgrading OR installing a single traditional SATA/SAS disk in each host is the only option. I mean, other than the workaround posted in this article.

    Can anyone explain to me how security patches work with vSphere? For example, if I downgrade to a build of vSphere 7u1 that doesn’t have this SDcard/USB problem, am I choosing to run a vulnerable ESXi build?

    It’s my understanding that the last ESXi 7 release that didn’t break SDcards/USB was ESXi-7.0U1b-17168206-standard (Build 17168206). Is that accurate?

    • Luciano Patrao 19/07/2021 at 15:42 - Reply

      Yes, you can downgrade to that version since is the latest I test and was working without this bug.

      Check what was it vSphere 7 U2a release notes what was fixed. Some security yes, but nothing special.

      After if you want to apply the security, take very attention to the updates that you are applying for. Just apply the security, not Update patch etc., or you will go back to U2a again.

  46. kamil 20/07/2021 at 06:29 - Reply

    they didnt release patch but they publish part of your workaround 🙂
    https://kb.vmware.com/s/article/83963?lang=en_US

  47. Fish 23/07/2021 at 18:46 - Reply

    Depending on the scale of your 7.02 deployments you all might consider putting in place nightly reboots of your clusters. Luckily we only have one cluster at this revision. We are supposedly on a pre-release of the patch with vmware or whatever support calls it. They were saying mid August for official release. I don’t know if that means a single small patch or U3. Thanks for posting this work around it is helpful!

  48. Adam Tyler 23/07/2021 at 23:40 - Reply

    So this command has been saving my bacon lately.
    esxcfg-rescan -d vmhba32

    I really don’t want to downgrade and rebuild hosts. Is there no cron mechanism in ESXi? Seems like it would be a pretty easy fix to just run this command every couple of hours with a local cron job if supported. The PowerShell method is probably better if you have hundreds of hosts, but that isn’t me.

    Something like this maybe?
    https://vswitchzero.com/2021/02/17/scheduling-tasks-in-esxi-using-cron/

    Regards,
    Adam Tyler

    • Luciano Patrao 24/07/2021 at 02:19 - Reply

      That is not a feasible option, since the commands can and should only be run when you have the issue. Running the issue without having the issue, will not prevent the issue to appear.
      And even if you have an issue, you should immediately reboot the ESXi hosts after you recover the SD Card.

  49. adam@tylerlife.us 24/07/2021 at 02:57 - Reply

    Well, I’m going to give it a shot. My cron file looks like this on all of my hosts now. Running the rescan every hour.

    #min hour day mon dow command
    1 1 * * * /sbin/tmpwatch.py
    1 * * * * /sbin/auto-backup.sh
    0 * * * * /usr/lib/vmware/vmksummary/log-heartbeat.py
    */5 * * * * /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh
    00 1 * * * localcli storage core device purge
    */10 * * * * /bin/crx-cli gc
    * */1 * * * esxcfg-rescan -d vmhba32 && esxcli system syslog mark –message=”Running esxcfg-rescan”

    Will let you know how it goes. Had some unplanned downtime this AM with two hosts going offline from vCenter’s perspective.

  50. Brandon M 26/07/2021 at 18:38 - Reply

    We had this issue after updating a UCS cluster (8 blades, each with it’s own SD card) to 7.0.2 and was told by VMware to apply the ramdisk workaround, but after a couple weeks we had another host go down. So I can confirm that the workaround is NOT going to last forever and we really need a fix from VMware. My idea is to move away from SD cards and setup ESXi boot from SAN. Has anyone else tried this method?

    • Luciano Patrao 27/07/2021 at 11:35 - Reply

      The workaround has two steps. One is to recover your ESXi hosts and be able to reboot and fix the issue (until it breaks again). The second one, ramdisk is to fix the issue and will at least extend the issue to happen more often. In some cases I have(2 or 3), it did fix until now. But is not 100% accurate that will work 70/80% of the time.

      But like is stated in the blog post, none of these workarounds are set to fix the issue permanantely.

    • Luciano Patrao 27/07/2021 at 16:48 - Reply

      Sorry did not reply to your boot from SAN question.

      We have some Dell servers where we use boot from SAN. It works fine, but I always only used it if the Storage has High Availability (LUN replication etc).
      Because if you lose your Storage, all of your environment goes down.

      It depends on the type of HA you want in your environment. But it works fine, with no issues.

  51. Luke 27/07/2021 at 12:10 - Reply

    Hi Brandon, regarding your question about booting off the SAN, yes we did it several years ago tihe the first generation blades (I believe BL460 and BL490) using QLOGIC iSCSI mezzanines and setting the controllers up with the boot LUNs on a P2000G3 SAN. This worked fine and would still do of course. I setup 8 dedicated LUNS (one per blade as we had a C3000) with a 32GB partition each. Never had an issue even if the cold boot sequence is not too fast, but that’s understandible. At the moment we are transitioning to M.2 SATA SSDs on Gen 10 Servers installed on the riser and ditching the SD. For older Gen 9 and Gen 8 machines the solution is installing a boot disk(s) or opting for a SATA SSD workaround which we need to test.
    It is quite obvious to me from VMware’s reluctant behaviour on releasing a quick fix on this issue not vendor related that the “smaller” fish are not of much interest any longer, its all big corporate project and installs they point on, at least that’s my take. This raises a big red flag.

  52. David Pasek 27/07/2021 at 22:49 - Reply

    Hi Brandon. Especially Cisco UCS blade servers were designed by Cisco UCS designers (Silvano Gai and his team) to leverage Boot from SAN as a preferred boot method. Such method allows “Cisco UCS stateless computing”. Cisco UCS Service Profiles (Logical Server Specifications) and Boot from Fibre Channel SAN was my typical recommendation, design decision and implementation when I worked for Cisco Advanced Services as a UCS Consultant. It was 10 years ago but this is still in use on some customers I’m in touch till today with a huge success, because it is the biggest UCS advantage over other server vendors. To be honest, it was not my nor Dell typical recommendation on Dell Servers when I worked for Dell Consulting Services as it is not as native as for Cisco UCS but it is still the design option. Now I work for VMware, and we are hardware agnostic, however, ESXi 7 and newer will require ESX-OS-Data storage device ideally having 240GB+. In your particular case, if you have free 2TB storage space (8x240GB) on your FC storage, I would definitely consider Cisco UCS with Boot from SAN and leverage ~250 GB FC LUNs as boot devices. This will quickly solve your current challenge with VMKUSB driver. Btw, if you will leverage UCS Service Profiles, such design would enable you to rip & replace any physical UCS server with just reassigning Service (Server) Profile and you are done in few minutes as it can preserve HBA WWNs, NIC MAC addresses, hardware UUID, etc. UCS service profile also contains firmware and hardware settings but this is off-topic. In terms of ESXi 7 boot device, I have expressed my design thoughts on my blog post at https://www.vcdx200.com/2021/06/vsphere-7-esxi-boot-media-parition.html Hope you will find this helpful.

    • Luciano Patrao 29/07/2021 at 15:16 - Reply

      Good reply David.

      Thanks again for contributing to the discussion and with good input to this blog post.

  53. Ross 28/07/2021 at 11:01 - Reply

    One host got this issue today again, but this time even vmhba32 is gone. esxcfg-rescan gives “Error: Invalid adapter specified or unable to get adapter ‘vmhba32′”. The host seems still responding and VMs seem running fine. Is reboot the only option?

    • Luciano Patrao 29/07/2021 at 15:15 - Reply

      You will always get errors. You need to wait some minutes and try again. After 3/4 times you run the command, the error will disappear. After you can vmotion the vms and reboot the server. Yes, reboot is the only option to ix the problem and you have your ESXi host back to operation.

  54. Florian 29/07/2021 at 15:21 - Reply

    Similar problem here. HPE Standalone Host on Update 2(a). Boot from interal usb-drive.
    After a few days suddenly no Veeam Backup (Application Error, could not initiate NFS filestream from datastore), no manual Snapshot Creation on host itself (stuck at 0%). VM status not reported correctly (guest shutdown, vm still shown active).
    Logfiles inaccessible using SSH, session keeps hangig when accessing/listing filesystem.

    Current workaround: manual shutdown of every single vm, reset host, cold boot.

    As this happens every 2-3 weeks we’ll try to get along with periodic reboots until Update 3. Hope this will be fixed soon! Total mess.

    • Luciano Patrao 29/07/2021 at 15:48 - Reply

      If you run the workaround running esxcfg-rescan -d vmhba32 that I specify here, no need to shutdown the VMs.
      After you have your ESXi host back, you can put in maintenance mode and all VMs will vMotion to a working ESXi host, then you can reboot the host.

      Sometimes when we do not run the esxcfg-rescan -d vmhba32 for a long time and the host is with the issue more than +\- 12h then is more VMs start getting stuck, some even can get invalid or orphaned. But after you fix the host, they will go back to normal.

      If you don’t want to power off the VMs(in my case most of my production VMs cannot afford a power off outside the maintenance window), you need to take patience a try and retry the command to recover the host without the need to power off VMs.

  55. Robert 29/07/2021 at 23:31 - Reply

    I’ve never had to reboot my hosts after running the 4 commands. They run for days or weeks after that until I need to run the commands again. Veeam seemed to trigger the problem a couple times but hasn’t lately.

  56. Patrick Long 29/07/2021 at 23:49 - Reply

    Is it my imagination, or did https://kb.vmware.com/s/article/83963 previously include the esxcfg-rescan and mgmt agent restarts almost verbatim from this blog post in the Workaround section…but this information has since been removed from the KB? I have been following so many pages regarding this issue they are all getting blurry now 😉 I continue to encounter this issue sproadically (despite already implementing years ago all known mitigations to reduce I/O to the SD boot device), but luckily I only upgraded a small portion of my environment to 7.0 U2a so it is manageable for me in the short term. The workarounds from this blog post have worked every time.

    • Luciano Patrao 30/07/2021 at 17:10 - Reply

      Yes, it was. It seems was removed.

      So many things strange in this bug, in this support, in how VMware is handling this.

  57. Markus 02/08/2021 at 13:43 - Reply

    Pls find the LINK for Dell EMC PowerEdge servers—SD card compatibility matrix with VMware vSphere ESXi 7 if this helps. Problem when checking: There is no way to read the exact type of SD Cards from the System without open it physically, as it is part from a special USB design. You/we have to open every single Host and check.

    https://www.dell.com/support/manuals/de-de/vmware-esxi-7.x/vmware_7.0.x_vsphere_compmatrix_pub/Dell-EMC-PowerEdge-serversSD-card-compatibility-matrix?guid=guid-89b7699f-9dbe-4efd-a325-d4cdf9cfd927&lang=en-us

    Pls let me know, if somebody found a way to check it online, I would highly appreciate that.

    Thanks, Markus

    • Luciano Patrao 02/08/2021 at 14:39 - Reply

      I don’t know any way to check it without look physically the SD card in the server. In BIOs, you don’t get much information.
      For us is easy because we have a track of all SD cards that we have in our servers.

    • Andy 04/08/2021 at 16:16 - Reply

      If you access the Dell Support page with the service tag of your PowerEdge server, you can view the entire configuration of the system (Quick links > View product specs) and when you expand the row referring to the SD Cards (not the IDSDM card reader, there is a different line for the SD cards themselves!), i.e. 16GB microSDHC/SDXC Card, you will find the part number of the SD cards installed in you server, e.g.:

      FH2KP ASSY,FSD,SDIG,16G,UHS,IDSDM,KN 2

      This part number is referenced in the (confidential) VMware SD card compatibility matrix PDF that is making the rounds on the internet.

    • Andy 05/08/2021 at 09:55 - Reply

      Go to the Dell Support site and enter the Service Tag of your PowerEdge server. On the right under “Quick Links” you can view the system’s configuration as shipped from factory. There you should find an entry like “16GB microSDHC/SDXC Card”. Expand it and you have the part number of the installed cards that can be cross-referenced to the compatibility matrix.

  58. Adam Tyler 02/08/2021 at 19:00 - Reply

    Can someone explain to me why it matters what kind of SD card is used? I realize that some SD cards are better than others. Faster or can handle more writes, etc.. But I was under the impression that this bug is related to the USB controller behind the SD card. All SD cards should work as far as I understood. If you use a crappy one, yes it is going to be slow and may fail to write, but it won’t go offline constantly like we are seeing.

    Am I off base here?

    • Luciano Patrao 03/08/2021 at 20:01 - Reply

      Is the same as any hardware. It needs to be on the HCL list. Many are not support but still work with VMware. Particularly SD cards is about the quality and the how many r/w and i/o can handle.

      But I agreed, that is not the root cause for this bug.

  59. ultrium 02/08/2021 at 21:57 - Reply

    Same error here with bullion hardware (bull sequana s400). Opened a ticket in vmware, but already expecting the same answer from the others. Workaround saved us from having downtime on 200 VMs. Thanks!