The vCenter crash was initiated by a network storage switch failure that affected all the iSCSI LUNs in my home lab. The failure occurred as a result of changes I made to the switch’s VLAN capacity, which necessitated a reboot. I approved the reboot request without thinking that I had my VMs running.
The Switch was back, but I had many problems with some of the LUNs not being connected and recognized by the host. So I needed to reboot the two ESXi hosts to recover the iSCSI LUNs while some VMs were running. Unfortunately, one of those VMs was the vCenter.
After the hosts were back and recovered all iSCSI LUNs and recognized all VMs, when I powered on vCenter, it was full of problems. The VM could identify the virtual network Switch (a Standard Switch) and complains that the Switch needs to be ephemeral (that we now are the only type vDS we should use when adding the vCenter network). The problem is that this was not a vDS but a Standart Switch.
Somehow the vCenter with the crash confused. But I had more management networks that I could add to the vCenter. But non was possible to add, even when trying to check the network in the vCenter config GUI(using VM console) no IP, no Gateway. Everything was a mess with that network.
So since I didn’t had many options, I decided to remove the network interface from the vCenter VM and then add a new one. After I did that, vCenter recognized the same as eth0, but again some issues with networks. I created a new Standart PortGroup on the ESXi host and added it to the vCenter. After that, I could power on the vCenter and had a network.
But as aspected, vCenter services did not start properly, and I began to see a lot of issues with vCenter and its services(not starting or starting and stopping etc.).
When trying vCenter on the browser, I get “no healthy upstream” and nothing happens. So it was time to check logs to see what happened inside the vCenter.
First, I checked the vpxd.log located in /var/log/vmware/vpxd, and that is when I started to see a lot of “duplicate key value violates unique constrain”.
Some examples of errors in the vpxd.log
1 2 3 4 5 |
--> error vpxd[07060] [Originator@6876 sub=Default opID=HB-host-4585@1685-44c6606d] An unrecoverable problem has occurred, stopping the VMware VirtualCenter service. Error: Error[VdbODBCError] (-1) "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vpx_entity" --> DETAIL: Key (id)=(38661) already exists.; --> Error while executing the query" is returned when executing SQL statement "INSERT INTO VPX_ENTITY (ID,NAME,TYPE_ID,PARENT_ID) VALUES (?,?,?,?)" panic vpxd[07060] [Originator@6876 sub=Default opID=HB-host-4585@1685-44c6606d] |
So I have some duplicated keys on the vCenter DB.
Next is to check the vPostgres logs in /storage/log/vmware/vpostgres/.
Check the latest logs.
1 2 3 4 5 6 7 |
root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# ls -alt postgresql*.log -rw------- 1 vpostgres vpgmongrp 10362308 Feb 5 02:13 postgresql-05.log -rw------- 1 vpostgres vpgmongrp 63139899 Feb 5 00:59 postgresql-04.log -rw------- 1 vpostgres vpgmongrp 3733 Feb 5 00:06 postgresql-20.log -rw------- 1 vpostgres vpgmongrp 607140 Feb 5 00:05 postgresql-19.log |
Now search for anything related to duplicate keys
Logs 04 and 05 had a lot of entries regarding the problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# cat postgresql-19.log | grep -i 'duplicate key' root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# cat postgresql-20.log | grep -i 'duplicate key' root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# cat postgresql-04.log | grep -i 'duplicate key' 63dee6dc.1ccb 7117680 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" 63dee737.2f51 7120794 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" 63dee782.3f66 7123098 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# cat postgresql-05.log | grep -i 'duplicate key' 63def560.193d 7125356 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" 63def7d6.9308 7127265 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" 63def994.1bda 7128563 VCDB vc ERROR: duplicate key value violates unique constraint "pk_vpx_entity" root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ]# cat postgresql-05.log | grep -i 'already exists' 63def4b7.87e 0 [unknown] archiver ERROR: replication slot "vpg_archiver" already exists 63def52e.116d 0 VCDB vc ERROR: relation "gin_kv_value" already exists 63def560.193d 7125356 VCDB vc DETAIL: Key (id)=(38661) already exists. 63def7d6.9308 7127265 VCDB vc DETAIL: Key (id)=(38661) already exists. 63def8d9.9b9 0 [unknown] archiver ERROR: replication slot "vpg_archiver" already exists 63def96d.156e 0 VCDB vc ERROR: relation "gin_kv_value" already exists 63def994.1bda 7128563 VCDB vc DETAIL: Key (id)=(38661) already exists. |
So I need to double-check the vPostgres DB and fix any duplicate keys that exist regarding this ID 38661
Note: Before login to vPostgres DB, stop the vpxd service with: service-control –stop vmware-vpxd
Connect to vCenter vPostgres DB.
1 2 3 |
root@nested-vcenter-06 [ /storage/log/vmware/vpostgres ] # /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB |
Then search for the ID in VPX_ENTITY
1 2 3 4 5 6 7 |
VCDB=# select * FROM VPX_ENTITY where id='38661'; id | name | type_id | parent_id -------+---------+---------+----------- 38661 | vcenter | 19 | 107 (1 row) |
Found one entry, and it says “vcenter”, which seems to be the network I created after the crash.
Check the parent_id to make sure this was a network.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
select * FROM VPX_ENTITY where parent_id='107'; VCDB=# select * FROM VPX_ENTITY where parent_id='107'; id | name | type_id | parent_id -------+----------------------------------+---------+----------- 152 | Groups vDS 1 | 15 | 107 32695 | HCX vMotion PortGroup | 15 | 107 1041 | VMs Network | 19 | 107 569 | vDS-Nested | 15 | 107 115 | Nested Network | 19 | 107 147 | Storage iSCSI | 19 | 107 150 | Main-vDS | 14 | 107 151 | Main-vDS-DVUplinks-150 | 15 | 107 265 | vDS-VLAN10-DVUplinks-264 | 15 | 107 264 | vDS-Storage | 14 | 107 32715 | HCX Extend Netwo-DVUplinks-32714 | 15 | 107 2868 | HyperV-Cluster | 15 | 107 2869 | HyperV-Network | 15 | 107 32717 | HCX Extended Network | 15 | 107 37674 | VM Network | 19 | 107 38661 | vcenter | 19 | 107 38662 | Network Management | 19 | 107 37675 | MA-VMW-Management | 19 | 107 4564 | vDS Storage VCF Network | 15 | 107 4562 | vDS Nested iSCSI 01 | 15 | 107 4561 | vDS Nested iSCSI 02 | 15 | 107 4563 | vDS Temp Nested | 15 | 107 4565 | vDS Nested Storage | 15 | 107 3045 | NetApp O&M | 15 | 107 37669 | none | 19 | 107 37676 | MA-VMW-VMotion | 19 | 107 3046 | NetApp Cluster | 15 | 107 32696 | HCX Replication PortGroup | 15 | 107 32697 | HCX Management PortGroup | 15 | 107 32698 | HCX Uplink PortGroup | 15 | 107 32714 | HCX Extend Network | 14 | 107 32716 | HCX Extended Network - 5 | 15 | 107 9661 | Management | 19 | 107 9662 | Management Network | 19 | 107 33700 | Extend Network HCX | 15 | 107 10675 | vSW nested iSCSI | 19 | 107 (36 rows) |
The format is not very good, but I can see that this is all networks. So the problem is in that network “vcenter“.
I needed to double-check and make sure I checked the type_id 19 which is what this network is using.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
select * FROM VPX_ENTITY where type_id='19'; VCDB=# select * FROM VPX_ENTITY where type_id='19'; id | name | type_id | parent_id -------+--------------------+---------+----------- 1041 | VMs Network | 19 | 107 115 | Nested Network | 19 | 107 147 | Storage iSCSI | 19 | 107 37674 | VM Network | 19 | 107 38661 | vcenter | 19 | 107 38662 | Network Management | 19 | 107 37675 | MA-VMW-Management | 19 | 107 37669 | none | 19 | 107 37676 | MA-VMW-VMotion | 19 | 107 9661 | Management | 19 | 107 9662 | Management Network | 19 | 107 10675 | vSW nested iSCSI | 19 | 107 (12 rows) |
So it is confirmed this is the network I created and is a Standard Switch portgroup.
So what I need to do next is to delete the duplicate key.
Important Note: Never touch the vCenter DB without creating a snapshot of the vCenter VM, or if you want a backup of the DB (check VMware KB for this)
1 2 3 4 5 |
VCDB=# delete FROM VPX_ENTITY where id='38661'; DELETE 1 VCDB=# |
After deleting the entry, we can start the service again with service-control –start vmware-vpxd, but I see many times that this is not enough, so it is better to reboot the vCenter.
After vCenter is powered on, the vCenter service still doesn’t want to start. So I check the vpxd logs again to see if there is anything else to fix, and I found this:
1 2 3 4 5 6 7 8 9 10 |
--> error vpxd[31049] [Originator@6876 sub=Default opID=HB-host-4585@7751-44c6606d] An unrecoverable problem has occurred, stopping the VMware VirtualCenter service. Error: Error[VdbODBCError] (-1) "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_n_vm_config_info" --> DETAIL: Key (id)=(38663) already exists.; --> Error while executing the query" is returned when executing SQL statement "INSERT INTO VPX_NON_ORM_VM_CONFIG_INFO (ID,CHANGE_VERSION,CHANGE_TRACKING_ENABLED,CPU_HOT_ADD_ENABLED,CPU_HOT_REMOVE_ENABLED,MEM_HOT_ADD_ENABLED,HARDWARE_NUM,HARDWARE_MEMORY,HARDWARE_CORES,VIRTUAL_ICH7M_PRESENT,VIRTUAL_SMC_PRESENT,TOOLS_BEFORE_GUEST_STANDBY_FLG,TOOLS_BEFORE_GUESTSHUTDOWN_FLG,TOOLS_TOOLS_UPGRADE_POLICY,TOOLS_AFTER_RESUME_FLG,TOOLS_AFTER_POWER_ON_FLG,TOOLS_SYNC_TIME_WITH_HOST_FLG,TOOLS_TOOLS_VERSION,TOOLS_LASTINSTALL_COUNTER,GUEST_FULL_NAME,INSTANCE_UUID,UUID,ANNOTATION,VERSION,TEMPLATE_FLG,M" panic vpxd[31049] [Originator@6876 sub=Default opID=HB-host-4585@7751-44c6606d] --> --> Panic: Unrecoverable VmRootError. Panic! --> Backtrace: --> [backtrace begin] product: VMware VirtualCenter, version: 7.0.3, build: build-20395099, tag: vpxd, cpu: x86_64, os: linux, buildType: release |
And in the vPostgres logs, I have this:
1 2 3 4 5 6 7 8 |
error vpxd[31049] [Originator@6876 sub=Default opID=HB-host-4585@7751-44c6606d] [VdbStatement] SQLError was thrown: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_n_vm_config_info"--> DETAIL: Key (id)=(38663) already exists.; --> Error while executing the query" is returned when executing SQL statement "INSERT INTO VPX_NON_ORM_VM_CONFIG_INFO (ID,CHANGE_VERSION,CHANGE_TRACKING_ENABLED,CPU_HOT_ADD_ENABLED,CPU_HOT_REMOVE_ENABLED,MEM_HOT_ADD_ENABLED,HARDWARE_NUM,HARDWARE_MEMORY,HARDWARE_CORES,VIRTUAL_ICH7M_PRESENT,VIRTUAL_SMC_PRESENT,TOOLS_BEFORE_GUEST_STANDBY_FLG,TOOLS_BEFORE_GUESTSHUTDOWN_FLG,TOOLS_TOOLS_UPGRADE_POLICY,TOOLS_AFTER_RESUME_FLG,TOOLS_AFTER_POWER_ON_FLG,TOOLS_SYNC_TIME_WITH_HOST_FLG,TOOLS_TOOLS_VERSION,TOOLS_LASTINSTALL_COUNTER,GUEST_FULL_NAME,INSTANCE_UUID,UUID,ANNOTATION,VERSION,TEMPLATE_FLG,M" warning vpxd[31049] [Originator@6876 sub=VpxProfiler opID=HB-host-4585@7751-44c6606d] DoHostSync:host-4585 [ProcessChanges] took 3019 ms warning vpxd[31049] [Originator@6876 sub=VpxProfiler opID=HB-host-4585@7751-44c6606d] DoHostSync:host-4585 [DoHostSyncTime] took 5982 ms warning vpxd[31049] [Originator@6876 sub=VpxProfiler opID=HB-host-4585@7751-44c6606d] InvtHostSyncLRO::StartWork [HostSyncTime] took 5982 ms panic vpxd[31049] [Originator@6876 sub=vpxCommon opID=HB-host-4585@7751-44c6606d] Unrecoverable VmRootError: 00007ff19c92ba40 Backtrace: |
So I have more duplicated keys again. This time in “pk_n_vm_config_info“. This is different because it is related to VMs.
I connected again to vPostgres DB and checked the id 38663.
Note: Again, don’t forget to create a back or a snapshot and stop the vpxd service.
1 2 3 4 5 6 7 |
VCDB=# select * FROM VPX_NON_ORM_VM_CONFIG_INFO where id='38663'; id | name | type_id | parent_id -------+-------------------------------------------+---------+----------- 38663 | vCLS-3a33c35d-7443-4dec-87a3-75c09efed368 | 0 | 4048 (1 row) |
Strange, this is the DRS vCLS VMs. Checking the type_id I see that exists two, so one is duplicated(the 38663)
1 2 3 4 5 6 7 8 |
VCDB=# select * FROM VPX_ENTITY where parent_id='4048'; id | name | type_id | parent_id -------+-------------------------------------------+---------+----------- 38663 | vCLS-3a33c35d-7443-4dec-87a3-75c09efed368 | 0 | 4048 38661 | vCLS-c53556b2-5f74-475e-8423-bf3769b90e39 | 0 | 4048 (2 rows) |
So I need to check the vpx_vm_text regarding this VM id.
1 2 3 4 5 6 7 8 9 10 11 |
VCDB=# select vm_id from vpx_vm_text where id=38663; vm_id ------- 32684 32684 32684 32684 32684 (5 rows) |
I find 5 rows here, so I need to delete everything from this VM id.
Here honestly, I didn’t know for sure witch tables I needed to remove regarding this VM id. Because it is not just from vpx_vm_text. There are entries in other tables. So I found a very useful blog post from a colleague Kabir from my company ITQ. He had a similar issue and listed all the tables we needed to delete. Since I had a backup, I gave it a try.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
delete from VPX_COMPUTE_RESOURCE_DAS_VM where VM_ID='32684'; delete from VPX_COMPUTE_RESOURCE_DRS_VM where VM_ID='32684'; delete from VPX_COMPUTE_RESOURCE_ORC_VM where VM_ID='32684'; delete from VPX_VM_SGXINFO where VM_ID='32684'; delete from VPX_GUEST_DISK where VM_ID='32684'; delete from VPX_VM_VIRTUAL_DEVICE where ID='32684'; delete from VPX_VM_DS_SPACE where VM_ID='32684'; delete from VPX_NON_ORM_VM_CONFIG_INFO where ID='32684'; delete from VPX_NORM_VM_FLE_FILE_INFO where VM_ID='32684'; delete from VPX_VDEVICE_BACKING_REL where VM_ID='32684'; delete from VPX_VIRTUAL_DISK_IOFILTERS where VM_ID='32684'; delete from VPX_VM_STATIC_OVERHEAD_MAP where VM_ID='32684'; delete from VPX_VM_TEXT where VM_ID='32684'; delete from VPX_VM where ID='32684'; |
Check if still exists any VM_ID=’32684′ and there was none. So all is clean for the vCLS VM.
Note: Thanks Kabir for the tips. This is why the #vcommunity and sharing knowledge is important.
Reboot the vCenter again and wait to see if all is good.
After the vCenter was rebooted, I checked vpxd and vPostgres DB logs, and all was clean about duplicate keys. So this problem was solved but I still have issues to have vCenter working and being able to login. vCenter service was running, and login page was shown, but when I tried to login is just was thinking, thinking, and nothing happened.
So go to the vpxd logs again and see some issues regarding the certificates and bad passwords. Strange error, but since I tried to change(or add since was empty) the IP and gateway manually and not from VAMI(VMware recommends that this change needs to be done through VAMI, or we can have some issues), then maybe I did also mix up the certificates. So I decided to recreate all of them.
How to recreate vCenter certificates? Use the certificate manager that is here: /usr/lib/vmware-vmca/bin/certificate-manager
Here you have the option to recreate one or all certificates. Since I don’t know what happened, I decided to recreate all certificates using option 8.
This is straightforward, just use the defaults and add your vCenter IP and FQDN for the hostname and VMCA. You can check VMware KB for how to do it.
It takes some minutes to finish(depending on the size of your vCenter), and then it reboots automatically.
After the reboot, I login again, and… VOILÀ!!! Finally, I have the vCenter back.
There were some minor issues regarding the networks that I needed to fix, but nothing special, and then vCenter was fully functional.
As we can see, by troubleshooting vCenter logs, we can easily find the root cause of the problem. In this case, since it includes vPostgres DB with duplicate keys, it can be tricky.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter.
Leave A Comment
You must be logged in to post a comment.