For the past two days, I’ve been troubleshooting a frustrating issue with my NSX v4.2 cluster. Both nodes refused to start, showing only error code 101 in the browser with no service details (unknown state). Connectivity between nodes was inconsistent, and the Corfu service failed, making troubleshooting even more challenging.
What followed was a deep dive into debugging, multiple failed attempts, and much trial and error before I finally restored full cluster functionality. In this post, “Troubleshooting NSX Manager Cluster Issues: Corfu, Connectivity, and Service Failures,” I’ll document the entire process, how I identified the issue, step-by-step troubleshooting, what didn’t work, what did, and my conclusions about why it happened.
The Initial Problem
It all started with one of my NSX Manager nodes, 192.168.1.85, showing some issues. Initially, I was focused on troubleshooting problems on 85 without realizing that 86 was also affected. The NSX GUI reported degraded clustering, and logs showed alarms related to Corfu, monitoring services, and minimum capacity thresholds.
I started by checking 85 first since it was the one showing issues.
Note: The following steps were performed in my own test lab environment and worked successfully in that setup. However, if you’re operating in a production environment, proceed with caution. These actions should only be carried out with guidance from VMware Support to avoid potential risks or disruptions.
Step 1: Checking Connectivity
To confirm that both nodes were communicating, I ran:
ping -c 4 192.168.1.86
This showed that 85 could reach 86 with no packet loss. However, trying to check Corfu connectivity using nc (Netcat) failed.
nc -zv 192.168.1.86 9000
nc: connect to 192.168.1.86 port 9000 (tcp) failed: Connection refused
Then, I checked whether 85 was listening on port 9000.
netstat -tulnp | grep 9000
Output:
(No output)
This meant Corfu was not running or not listening properly on 85.
To confirm, I checked the Corfu service status:
service corfu-server status
Output:
● corfu-server.service - Corfu Infrastructure Server
Active: active (running) since Mon 2025-03-03 18:21:10 UTC; 3min 5s ago
...
Mar 03 18:21:32 nsx-vcd-veeam corfu-server[270994]: ERROR | CorfuServer | Failed starting server
Mar 03 18:21:36 nsx-vcd-veeam corfu-server[270994]: ERROR | ClientHandshakeHandler | fireHandshakeFailed: Handshake Failed. Close Channel.
Mar 03 18:21:40 nsx-vcd-veeam corfu-server[270994]: ERROR | ClientHandshakeHandler | Handshake Failed. Close Channel.
This confirmed that Corfu was running but not communicating correctly.
Step 2: Investigating Corfu on 85
Since Corfu is critical for NSX-T’s database synchronization, I needed to check if it ran properly on 85. The logs showed handshake failures, meaning Corfu couldn’t establish proper communication.
I restarted Corfu:
service corfu-server restart
Checking status again:
service corfu-server status
Errors persisted:
Mar 03 18:22:12 nsx-vcd-veeam corfu-server[272004]: ERROR | CorfuServer | Failed starting server
Mar 03 18:22:14 nsx-vcd-veeam corfu-server[272004]: WARN | Segment | closeSegmentHandlers: Segment /config/corfu/log/203.log is trimmed
Since restarting didn’t work, I decided to clear Corfu logs and reset the database.
Step 3: Cleaning Corfu Data on 85
Since Corfu logs showed errors related to segment handling and trim failures, I forcefully cleared them:
rm -rf /var/log/corfu/*
rm -rf /config/corfu/*
service corfu-server restart
This time, Corfu started properly on 85, but the connection was still failing when I tested the communication with 86.
Step 4: Investigating Corfu on 86
After confirming that 85 was working, I turned my attention to 86. Checking the Corfu status showed the same errors. The service was running, but it wasn’t listening on its own IP and port.
netstat -tulnp | grep 9000
Output:
(No output)
Checking logs:
cat /var/log/corfu/corfu.9000.log | grep -i "error\|fail\|warn" | tail -n 20
Errors:
ERROR | CorfuServer | Failed starting server
ERROR | ClientHandshakeHandler | fireHandshakeFailed: Handshake Failed. Close Channel.
This meant 86 was running Corfu but wasn’t properly binding to the network.
Step 5: Cleaning Corfu Data on 86
Since clearing Corfu logs and data worked on 85, I applied the same fix on 86.
rm -rf /var/log/corfu/*
rm -rf /config/corfu/*
service corfu-server restart
This time, Corfu correctly started and was now listening on 192.168.1.86:9000.
To verify, I ran:
netstat -tulnp | grep 9000
Output:
tcp 0 0 192.168.1.86:9000 0.0.0.0:* LISTEN 97076/java
Now Corfu was listening on 86!
Step 6: Testing Cross-Node Connectivity
With Corfu running on both nodes, I tested connectivity again:
nc -zv 192.168.1.86 9000
Connection to 192.168.1.86 9000 port [tcp/*] succeeded!
Success! Now that 85 could reach 86, it was time to check the NSX Manager cluster status.
get cluster status
At first, it was still DEGRADED, but then, services started coming back up automatically.
Step 7: Monitoring NSX Cluster Recovery
I kept monitoring get cluster status and saw that services were slowly transitioning from DEGRADED to STABLE. However, the MONITORING service was still down.
Group Type: MONITORING
Group Status: DEGRADED
Members:
9d90eb58-2367-4bbd-9aaa-b3cc74d08dac 192.168.1.86 DOWN
4eff1642-f0f3-a5e8-b9ac-37dffe169c95 192.168.1.85 UP
I restarted it manually:
restart service monitoring
Once that was done, all services were finally UP on both 85 and 86, and the NSX Manager cluster was fully operational.
Why Did This Happen?
From my troubleshooting, I suspect the root cause was Corfu log corruption or a stale database on both 85 and 86.
- NSX relies on Corfu for distributed state synchronization, and if Corfu isn’t running properly, NSX services fail.
- Corfu was running on both nodes but wasn’t listening on the correct ports, which prevented them from syncing.
- Clearing logs and the database forced a reset that allowed Corfu to start properly and bind to the correct network interfaces.
- Once Corfu ran properly on both nodes, services could reconnect and recover automatically.
Lessons Learned
- Always check Corfu first when NSX Manager clustering issues arise.
- Just because the service is “running” doesn’t mean it’s working. Always check if it’s listening on the correct port.
- Netcat (
nc) is a great tool for verifying connectivity. - If restarting doesn’t help, clearing logs and the database might be necessary.
- Fixing one node at a time is the right approach. Address the node that should be working first before fixing the unreachable node.
This was another interesting issue, and I hope this troubleshooting process helps others facing a similar problem.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter(yes for me is not X but still Twitter).
Leave A Comment