A couple of weeks ago, we had a strange issue with one of our NSX-T clusters. NSX-T log partition full and not able to login.
Since I am not the first line support in this system, they could not login to NSX-T console or through ssh, but sometimes GUI as possible. They thought was a user/password problem and try to reset the password a couple of times. That was the wrong decision; they should not do that.
This did not work and triggered more issues within the NSX-T nodes, and there were no NSX-T Backups set in this environment.
Note: When resetting the root/admin password for an NSX-T Cluster, do not forget that you need to change the file “/config/vmware/nsx-node-api/reset_cluster_credentials” if not, 5 minutes after, the system will rollback the password for the old password (since is not possible to sync with other nodes and NSX-T Cluster needs to know that password was changed by touching this file).
Before all NSX-T went down, it was possible to check in the GUI the source of the issue, was partition /var/log was 100%, then with some dump files(a lot of bad logins, reboots, etc.), also /var/dump was full.
The environment is an NSX-T 2.4.1 Cluster with three nodes.
NSX-T Error
It started with one node degraded (when logs were still 96/97%) and then quickly change to 100%.
Then all nodes in the cluster went to error, and after a while, all were down.
When trying to connect to GUI, all services were down and also unknown status.
Note: For security reasons, I have changed IP and hostnames in the images.
Since I could not login to the console or through ssh, I need to find a way to delete these logs and try to recover the nodes.
Run Ubuntu on the NSX node.
The solution I found was to use Ubuntu to reset the NSX-T root/admin password; you can download it from HERE.
Rebooting the NSX-T node and booting with Ubuntu, we can access the partitions and delete some log files and try to recover the NSX-T Cluster.
Add the ISO to a Datastore and add it to your NSX-T VM and reboot the VM with the Ubuntu.
Select you and language and then “Rescue a broken system.”
Then you will get that Ubuntu cannot have a network(DHCP or static). Just continue and select the option “Do not configure network at this time.” We will not need a network to perform these tasks.
Next, do:
- Hostname: Leave the default “ubuntu”.
- Configure the clock: Select your time zone.
Next, select the partition you want to work on. Select the device to work as root. For this case we need to work in /var/log/ that is /dev/nsx/var+log/ we can select directly the partition we need.
But we could use the option “Do not a root file system” and then mount the /var/log/. But since I do not need anything else than working in the /dev/nsx/var+log/, I select just a partition that will mount that folder that will be mounted in /target/.
Next, select “Execute a shell in the installer environment” so that we can work in the partition we need.
As we can notice in the next image, Ubuntu will boot with the folder /dev/nsx/var+log/ mounted in /target.
Check and delete logs.
Now we have access to the partition. We check that it is 100% full and will check which logs are filling this partition.
Then check the logs, I notice that two were big and were the ones that triggered the issue. These were syslog.1 and auth.log.1.
Those two logs were 80% of the partition usage. So after I deleted it, it drops to 21% usage.
After rebooting the NSX-T node, I was able to log in (before I needed to reset the password again since I did not know which password was changed). I also need to fix the /var/dump/ partition before recovering the nodes.
But before I try to check the logs, I could see a lot of errors with sync and problem in the CBM.
1 2 3 4 |
nsxt-manager-NSXT-01 NSX 18573 - [nsx@6876 comp="nsx-manager" errorCode="MP2101" subcomp="manager"] Request GET http://localhost:7441/api/v1/cluster-manager/config HTTP/1.1 failed, return code is 503 nsxt-manager-NSXT-01 NSX 18573 - [nsx@6876 comp="nsx-manager" errorCode="MP2123" subcomp="manager"] Cluster config retrieved from cluster manager is empty. |
But most of these issues were because the Cluster was down and nodes not accessible. So I could not find any real issue that triggers this problem. But I think the problem was feeding the logs for many days until they reach 100% of the partition.
Since there was no useful information in syslog.log, cmb.log to understand what was wrong and why this happens, I did not get much information since I deleted the previous logs.
Fix NSX-T Cluster.
After doing the above tasks in all NSX nodes in the Cluster to fix each one, I rebooted and login to the console.
Checking Cluster status, all were still with UNKNOWN status and down.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
root@NSXT-02:~# su admin NSX CLI (Manager, Policy, Controller 2.4.1.0.0.13716579). Press ? for command list or enter: help NSXT-02> get cluster status Cluster Id: 2925c404-5f61-4572-a2da-643f5e56a0ff Group Type: DATASTORE Group Status: STABLE Members: UUID FQDN IP STATUS eae91642-d50a-8157-2ffe-c9945250467b NSXT-01 192.168.1.150 UP ea7dd8b9-a78a-4626-be3b-30114a776ab3 NSXT-03 192.168.1.152 UP 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb NSXT-02 192.168.1.151 UP Group Type: CLUSTER_BOOT_MANAGER Group Status: STABLE Members: UUID FQDN IP STATUS ea7dd8b9-a78a-4626-be3b-30114a776ab3 NSXT-03 192.168.1.152 UP eae91642-d50a-8157-2ffe-c9945250467b NSXT-01 192.168.1.150 UP 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb NSXT-02 192.168.1.151 UP Group Type: CONTROLLER Group Status: DEGRADED Members: UUID FQDN IP STATUS c39205ee-e51a-468e-b7c1-3daeb3f6f4c0 NSXT-02 192.168.1.151 DOWN fbc9ce78-d1a8-43de-a3ad-98ac66c5c45b NSXT-01 192.168.1.150 UP 2c05edd4-0d28-4f55-8bf2-d99e7fc8e6d1 NSXT-03 192.168.1.152 DOWN Group Type: MANAGER Group Status: UNAVAILABLE Members: UUID FQDN IP STATUS ea7dd8b9-a78a-4626-be3b-30114a776ab3 NSXT-03 192.168.1.152 DOWN eae91642-d50a-8157-2ffe-c9945250467b NSXT-01 192.168.1.150 DOWN 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb NSXT-02 192.168.1.151 DOWN Group Type: POLICY Group Status: UNAVAILABLE Members: UUID FQDN IP STATUS ea7dd8b9-a78a-4626-be3b-30114a776ab3 NSXT-03 192.168.1.152 DOWN eae91642-d50a-8157-2ffe-c9945250467b NSXT-01 192.168.1.150 DOWN 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb NSXT-02 192.168.1.151 DOWN Group Type: HTTPS Group Status: UNAVAILABLE Members: UUID FQDN IP STATUS ea7dd8b9-a78a-4626-be3b-30114a776ab3 NSXT-03 192.168.1.152 DOWN eae91642-d50a-8157-2ffe-c9945250467b NSXT-01 192.168.1.150 DOWN 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb NSXT-02 192.168.1.151 DOWN |
To try to fix the NSX nodes, I power off 01 and 03 and reboot 02 and try to recover this one only.
After some minutes, 02 was working, and in GUI status was green. So I power on 01 and 03 and wait a while so that they could sync. After 30m, 01 was working, but 03 was still down.
I was able to login to node 03, but with each command in the NSX admin console, I get: “% The get cluster config operation cannot be processed currently, please try again later.”
After several attempts to recover this node 03, it was decided that it was best was to detach this node from the Cluster and deploy a new one.
How to detach a node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
root@nxst-01:~# su admin NSX CLI (Manager, Policy, Controller 2.4.1.0.0.13716579). Press ? for command list or enter: help NSXT-01> get mananagers - 192.168.1.152 Standby - 192.168.1.151 Connected - 192.168.1.150 Connected NSXT-01> get nodes UUID Type Display Name eae91642-d50a-8157-2ffe-c9945250467b mgr NSXT-01 ea7dd8b9-a78a-4626-be3b-30114a776ab3 mgr NSXT-03 1eb5cfb4-ba2d-4dca-8d04-f13baf35c9eb mgr NSXT-02 NSXT-01> detach node ea7dd8b9-a78a-4626-be3b-30114a776ab3 Node has been detached. Detached node must be deleted permanently. |
After deploying a new NSX node, the Cluster was running 100% and no issue until now.
Final Node:
To be able to write this NSX-T log partition full and not able to login blog post and after I fix the partition log issue, I get the help and good tips from Abdullah Abdullah on how to fix the Cluster. And once again, I would like to thank him for his support and time. Please check Abdullah’s blog doOdzZZ’sNotes. There is great content about NSX.
I hope my first NSX-T blog, “NSX-T log partition full and not able to login,” post was useful.
Note: Share this article if you think it is worth sharing.
[…] Home/NSX, VMware/How to upgrade NSX-T 2.4.x or 2.5.x to NSX-T 3.0 Previous […]
[…] 10. NSX-T log partition full and not able to login > ProVirtualzone … […]
[…] 6. NSX-T log partition full and titinada able to login > ProVirtualzone … […]