Today, I had a storage issue, and all my VMs crashed. One of those VMs was my VMware Cloud Director (vCD). After resolving the storage problem and rebooting the VM, I noticed that vCD did not start properly. Checking the services, I found that some failed with errors, preventing vCD from functioning. In this VMware Cloud Director Failure After Storage Crash – Troubleshooting and Fix, I will explain what happened and what I fixed in step-by-step.
Investigating the Issue
Step 1: Checking vCD Services
The first thing I did was check the status of vCD services using:
systemctl status vmware-vcd
The output showed that vCD failed to start, but the error was unclear. To investigate further, I checked the logs:
tail -f /opt/vmware/vcloud-director/logs/cell.log
Logs Indicated Issues with vPostgres (vpostgres) Service
The logs showed vCD was waiting for the database:
Application Initialization: 'Legacy Cell Application' 68% complete. Subsystem 'com.vmware.vcloud.vapp-lifecycle' started
Application Initialization: 'Legacy Cell Application' 72% complete. Subsystem 'com.vmware.vcloud.content-library' started
Application startup event: Subsystem 'CarePackage Cell Application' startup initiated.
Error starting application: Unable to connect to database
This led me to check if the vPostgres (vpostgres) service was running.
Step 2: Checking vPostgres Service
systemctl status vpostgres
The output showed:
* vpostgres.service - VMware Postgres database server
Active: failed (Result: exit-code)
Process: 20688 ExecStart=/opt/vmware/vpostgres/current/bin/pg_ctl -s -D ${VMWARE_POSTGRES_DATA} -w -t ${VMWARE_POSTGRES_PGCTL_TIMEOUT} start (code=exited, status=1/FAILURE)
Feb 26 13:40:28 vCD-10.4.vmwarehome.lab postgres[20688]: 2025-02-26 12:40:28.547 UTC [20690] LOG: redirecting log output to logging collector process
Feb 26 13:40:28 vCD-10.4.vmwarehome.lab postgres[20688]: 2025-02-26 12:40:28.547 UTC [20690] HINT: Future log output will appear in directory "log".
Feb 26 13:40:28 vCD-10.4.vmwarehome.lab postgres[20688]: pg_ctl: could not start server
Feb 26 13:40:28 vCD-10.4.vmwarehome.lab postgres[20688]: Examine the log output.
Feb 26 13:40:28 vCD-10.4.vmwarehome.lab systemd[1]: vpostgres.service: Failed with result 'exit-code'.
Step 3: Checking Postgres Logs
Since the service wasn’t starting, I checked the database logs for more details:
tail -n 50 /var/vmware/vpostgres/10/pgdata/log/postgresql-26.log
I found the following:
2025-02-26 12:27:10.653 UTC [828] LOG: invalid record length at 4/DD0355A0: wanted 24, got 0
2025-02-26 12:27:10.822 UTC [828] LOG: request to flush past end of generated WAL; request 4/DDBBF540, currpos 4/DD0355A0
2025-02-26 12:27:10.822 UTC [828] CONTEXT: writing block 0 of relation base/16385/22842_vm
2025-02-26 12:27:10.822 UTC [828] FATAL: xlog flush request 4/DDBBF540 is not satisfied --- flushed only to 4/DD0355A0
2025-02-26 12:27:10.825 UTC [798] LOG: startup process (PID 828) exited with exit code 1
This indicated a WAL (Write-Ahead Log) corruption, likely caused by the storage crash.
Attempted Fixes
Fix 1: Restarting the Service (Did Not Work)
Tried a simple restart:
systemctl restart vpostgres
Still failed.
Fix 2: Removing Stale Lock Files (Did Not Work)
Checked for stale PID files:
rm -f /var/vmware/vpostgres/10/pgdata/postmaster.pid
systemctl restart vpostgres
Still no success.
Fix 3: Forcing Recovery by Clearing WAL Logs (Worked)
- Created a backup of WAL logs:
cp -r /var/vmware/vpostgres/10/pgdata/pg_wal /var/vmware/vpostgres/10/pgdata/pg_wal_backup - Removed corrupted WAL logs:
rm -rf /var/vmware/vpostgres/10/pgdata/pg_wal/* - Ensured correct ownership and permissions:
chown -R postgres:users /var/vmware/vpostgres/10/pgdata/pg_wal - Restarted vPostgres service:
systemctl restart vpostgres
This time, vPostgres started successfully.
Final Step: Restarting vCD
With vPostgres running, I restarted the VMware Cloud Director service:
systemctl restart vmware-vcd
Then checked the logs:
tail -f /opt/vmware/vcloud-director/logs/cell.log
After a few minutes, vCD was fully operational.
Conclusion
The root cause of the issue was the abrupt storage failure, which caused VMware Cloud Director (vCD) and its embedded PostgreSQL database (vPostgres) to crash unexpectedly. When storage failures occur, running VMs and services often stop without a proper shutdown, leading to inconsistencies in data, corrupted transaction logs, and lingering lock files.
In this case, the PostgreSQL database detected an improper shutdown and attempted an automatic recovery. However, it failed due to an invalid WAL (Write-Ahead Log) record, causing the database startup process to exit with errors. The logs clearly indicated issues with flushing WAL records beyond the last known valid transaction, which prevented vPostgres from fully recovering.
This type of failure can happen in several scenarios, including:
- Power outages or unplanned reboots that interrupt active transactions.
- Storage connectivity issues where the database loses access to its data mid-operation.
- Disk corruption or filesystem errors affecting critical database files.
- High I/O latency or overload leading to service crashes due to resource exhaustion.
To resolve the issue, we had to remove stale lock files manually (postmaster.pid), clear out corrupted WAL segments, and ensure vPostgres can perform a clean recovery. Restarting the service alone wasn’t enough—direct intervention was required to fix the underlying issue before vCD could start normally.
The key takeaway is that unexpected storage failures can lead to database corruption, requiring manual recovery steps to restore services. Monitoring storage health, ensuring backups are in place, and having a clear recovery plan can help mitigate similar incidents in the future. While vPostgres has built-in recovery mechanisms, they are not always sufficient when logs or transactions are unrecoverable, making manual intervention necessary.
- Cause of Issue: Storage crash led to WAL corruption in vPostgres, preventing vCD from starting.
- Symptoms: vCD logs showed database connectivity issues, and vPostgres logs indicated WAL corruption.
- Fixes Attempted:
- Restarting the service – No success.
- Removing stale PID files: No success.
- Clearing WAL logs and forcing recovery – Success.
- Final Outcome: Successfully restored vPostgres and vCD.
This was an interesting issue, and I hope this troubleshooting process helps others facing a similar problem.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter(yes for me is not X but still Twitter).
Leave A Comment