/Troubleshooting NSX Manager Cluster Issues: Corfu, Connectivity, and Service Failures

Troubleshooting NSX Manager Cluster Issues: Corfu, Connectivity, and Service Failures

For the past two days, I’ve been troubleshooting a frustrating issue with my NSX v4.2 cluster. Both nodes refused to start, showing only error code 101 in the browser with no service details (unknown state). Connectivity between nodes was inconsistent, and the Corfu service failed, making troubleshooting even more challenging.

What followed was a deep dive into debugging, multiple failed attempts, and much trial and error before I finally restored full cluster functionality. In this post, “Troubleshooting NSX Manager Cluster Issues: Corfu, Connectivity, and Service Failures,” I’ll document the entire process, how I identified the issue, step-by-step troubleshooting, what didn’t work, what did, and my conclusions about why it happened.

The Initial Problem

It all started with one of my NSX Manager nodes, 192.168.1.85, showing some issues. Initially, I was focused on troubleshooting problems on 85 without realizing that 86 was also affected. The NSX GUI reported degraded clustering, and logs showed alarms related to Corfu, monitoring services, and minimum capacity thresholds.

I started by checking 85 first since it was the one showing issues.

Note: The following steps were performed in my own test lab environment and worked successfully in that setup. However, if you’re operating in a production environment, proceed with caution. These actions should only be carried out with guidance from VMware Support to avoid potential risks or disruptions.

Step 1: Checking Connectivity

To confirm that both nodes were communicating, I ran:

ping -c 4 192.168.1.86

This showed that 85 could reach 86 with no packet loss. However, trying to check Corfu connectivity using nc (Netcat) failed.

nc -zv 192.168.1.86 9000
nc: connect to 192.168.1.86 port 9000 (tcp) failed: Connection refused

Then, I checked whether 85 was listening on port 9000.

netstat -tulnp | grep 9000

Output:

(No output)

This meant Corfu was not running or not listening properly on 85.

To confirm, I checked the Corfu service status:

service corfu-server status

Output:

● corfu-server.service - Corfu Infrastructure Server
     Active: active (running) since Mon 2025-03-03 18:21:10 UTC; 3min 5s ago
...
Mar 03 18:21:32 nsx-vcd-veeam corfu-server[270994]: ERROR | CorfuServer | Failed starting server
Mar 03 18:21:36 nsx-vcd-veeam corfu-server[270994]: ERROR | ClientHandshakeHandler | fireHandshakeFailed: Handshake Failed. Close Channel.
Mar 03 18:21:40 nsx-vcd-veeam corfu-server[270994]: ERROR | ClientHandshakeHandler | Handshake Failed. Close Channel.

This confirmed that Corfu was running but not communicating correctly.

Step 2: Investigating Corfu on 85

Since Corfu is critical for NSX-T’s database synchronization, I needed to check if it ran properly on 85. The logs showed handshake failures, meaning Corfu couldn’t establish proper communication.

I restarted Corfu:

service corfu-server restart

Checking status again:

service corfu-server status

Errors persisted:

Mar 03 18:22:12 nsx-vcd-veeam corfu-server[272004]: ERROR | CorfuServer | Failed starting server
Mar 03 18:22:14 nsx-vcd-veeam corfu-server[272004]: WARN  | Segment | closeSegmentHandlers: Segment /config/corfu/log/203.log is trimmed

Since restarting didn’t work, I decided to clear Corfu logs and reset the database.

Step 3: Cleaning Corfu Data on 85

Since Corfu logs showed errors related to segment handling and trim failures, I forcefully cleared them:

rm -rf /var/log/corfu/*
rm -rf /config/corfu/*
service corfu-server restart

This time, Corfu started properly on 85, but the connection was still failing when I tested the communication with 86.

Step 4: Investigating Corfu on 86

After confirming that 85 was working, I turned my attention to 86. Checking the Corfu status showed the same errors. The service was running, but it wasn’t listening on its own IP and port.

netstat -tulnp | grep 9000

Output:

(No output)

Checking logs:

cat /var/log/corfu/corfu.9000.log | grep -i "error\|fail\|warn" | tail -n 20

Errors:

ERROR | CorfuServer | Failed starting server
ERROR | ClientHandshakeHandler | fireHandshakeFailed: Handshake Failed. Close Channel.

This meant 86 was running Corfu but wasn’t properly binding to the network.

Step 5: Cleaning Corfu Data on 86

Since clearing Corfu logs and data worked on 85, I applied the same fix on 86.

rm -rf /var/log/corfu/*
rm -rf /config/corfu/*
service corfu-server restart

This time, Corfu correctly started and was now listening on 192.168.1.86:9000.

To verify, I ran:

netstat -tulnp | grep 9000

Output:

tcp        0      0 192.168.1.86:9000       0.0.0.0:*               LISTEN      97076/java

Now Corfu was listening on 86!

Step 6: Testing Cross-Node Connectivity

With Corfu running on both nodes, I tested connectivity again:

nc -zv 192.168.1.86 9000
Connection to 192.168.1.86 9000 port [tcp/*] succeeded!

Success! Now that 85 could reach 86, it was time to check the NSX Manager cluster status.

get cluster status

At first, it was still DEGRADED, but then, services started coming back up automatically.

Troubleshooting NSX Manager Cluster Issues: Corfu, Connectivity, and Service Failures

Step 7: Monitoring NSX Cluster Recovery

I kept monitoring get cluster status and saw that services were slowly transitioning from DEGRADED to STABLE. However, the MONITORING service was still down.

Group Type: MONITORING
Group Status: DEGRADED

Members:
    9d90eb58-2367-4bbd-9aaa-b3cc74d08dac       192.168.1.86      DOWN
    4eff1642-f0f3-a5e8-b9ac-37dffe169c95       192.168.1.85      UP

I restarted it manually:

restart service monitoring

Once that was done, all services were finally UP on both 85 and 86, and the NSX Manager cluster was fully operational.

Why Did This Happen?

From my troubleshooting, I suspect the root cause was Corfu log corruption or a stale database on both 85 and 86.

  • NSX relies on Corfu for distributed state synchronization, and if Corfu isn’t running properly, NSX services fail.
  • Corfu was running on both nodes but wasn’t listening on the correct ports, which prevented them from syncing.
  • Clearing logs and the database forced a reset that allowed Corfu to start properly and bind to the correct network interfaces.
  • Once Corfu ran properly on both nodes, services could reconnect and recover automatically.

Lessons Learned

  • Always check Corfu first when NSX Manager clustering issues arise.
  • Just because the service is “running” doesn’t mean it’s working. Always check if it’s listening on the correct port.
  • Netcat (nc) is a great tool for verifying connectivity.
  • If restarting doesn’t help, clearing logs and the database might be necessary.
  • Fixing one node at a time is the right approach. Address the node that should be working first before fixing the unreachable node.

This was another interesting issue, and I hope this troubleshooting process helps others facing a similar problem.

Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter(yes for me is not X but still Twitter).

©2025 ProVirtualzone. All Rights Reserved
By | 2025-03-24T16:06:10+01:00 March 3rd, 2025|NSX, VMware Posts|0 Comments

About the Author:

I have over 20 years of experience in the IT industry. I have been working with Virtualization for more than 15 years (mainly VMware). I recently obtained certifications, including VCP DCV 2022, VCAP DCV Design 2023, and VCP Cloud 2023. Additionally, I have VCP6.5-DCV, VMware vSAN Specialist, vExpert vSAN, vExpert NSX, vExpert Cloud Provider for the last two years, and vExpert for the last 7 years and a old MCP. My specialties are Virtualization, Storage, and Virtual Backup. I am a Solutions Architect in the area VMware, Cloud and Backup / Storage. I am employed by ITQ, a VMware partner as a Senior Consultant. I am also a blogger and owner of the blog ProVirtualzone.com and recently book author.

Leave A Comment