NSX-T log partition full and not able to login

/, VMware/NSX-T log partition full and not able to login

NSX-T log partition full and not able to login

A couple of weeks ago, we had a strange issue with one of our NSX-T clusters. NSX-T log partition full and not able to login.

Since I am not the first line support in this system, they were not able to login to NSX-T console or trough ssh, but sometimes GUI was possible, they thought was a user/password problem and try to reset the password a couple of times. That was the wrong decision; they should not do that.

This did not work and also did trigger more issues within the NSX-T nodes, and there were no NSX-T Backups set in this environment.

Note: When resetting root/admin password for an NSX-T Cluster do not forget that you need to change the file “/config/vmware/nsx-node-api/reset_cluster_credentials” if not, 5 minutes after the system will rollback the password for the old password (since is not possible to sync with other nodes and NSX-T Cluster needs to know that password was changed by touching this file).

Before all NSX-T went down was possible to check in the GUI the source of the issue, was partition /var/log was 100%, then with some dump files(a lot of bad logins, reboots, etc.), also /var/dump was full.

The environment is an NSX-T 2.4.1 Cluster with three nodes.

NSX-T Error

It started with one node degraded (when logs were still 96/97%) and then quickly change to 100%.
NST-T log partition full and not able to login

Then all nodes in the cluster went to error, and after a while, all were down.

NST-T log partition full and not able to login

When trying to connect to GUI, all services were down and also unknown status.

NST-T log partition full and not able to login

Note: For security reasons, I have changed IP and hostnames in the images.

Since I was not able to login to console or trough ssh, I need to find a way to delete these logs and try to recover the nodes.

Run Ubuntu on the NSX node.

The solution I found was to use the Ubuntu that we use to reset NSX-T root/admin password; you can download from HERE.

Rebooting the NSX-T node and booting with Ubuntu, we can access the partitions and try to delete some log files and try to recover the NSX-T Cluster.

Add you the ISO to a Datastore and add to your NSX-T VM and reboot the VM with the Ubuntu.

Select you and language and then “Rescue a broken system.”

NST-T log partition full and not able to login

Then you will get that Ubuntu cannot have a network(DHCP or static), just continue and select the option “Do not configure network at this time.” We will not need a network to perform these tasks.

NST-T log partition full and not able to login

Next, do:

  • Hostname: Leave the default “ubuntu”.
  • Configure the clock: Select your time zone.

Next, select the partition you want to work on. Select the device to work as root. For this case we need to work in /var/log/ that is /dev/nsx/var+log/ we can select directly the partition we need.

But we could use the option “Do not a root file system” and then mount the /var/log/. But since I do not need anything else than working in the /dev/nsx/var+log/, I select just partition that will mount that folder that will be mounted in /target/.

Next, select “Execute a shell in the installer environment” so that we can work in the partition we need.

As we can notice in the next image, Ubuntu will boot with the folder /dev/nsx/var+log/ mounted in /target.

Check and delete logs.

Now we have access to the partition we check that is 100% full and will check which logs are fulling this partition.

Then check the logs I notice that two were big and was the ones that triggered the issue. These were syslog.1 and auth.log.1.

Those two logs were 80% of the partition usage. So after I deleted, it drops to 21% usage.

After rebooting the NSX-T node, I was able to log in (before I needed to reset the password again since I did not know which password was changed). I also need to fix the /var/dump/ partition before I try to recover the nodes.

But before I try to check the logs and I could see that lot of errors with sync and problem in CBM

But most of this issues were because Cluster was down and nodes not accessible. So I could not find any real issue that triggers this problem. But I think the problem was feeding the logs from many days until they reach 100% of the partition.

Since there was no useful information in syslog.log, cmb.log  to understand what was wrong and why this happens. Unfortunately, since I deleted the previous logs, I did not get much information.

Fix NSX-T Cluster.

After doing the above tasks in all NSX nodes in the Cluster to fix each one, I rebooted and login to console.

Checking Cluster status, all were still with UNKNOWN status and down.

So to try to fix the NSX nodes, I power off 01 and 03 and reboot 02 and try to recover this one only.

After some minutes, 02 was working, and in GUI status was green. So I power on 01 and 03 and wait a while so that they could sync. After 30m, 01 was working, but 03 was still down.

I was able to login to node 03, but each command in the NSX admin console I get: “% The get cluster config operation cannot be processed currently, please try again later”

After several attempts to recover this node 03, it was decided that it was best was to detach this node from the Cluster and deploy a new one.

How to detach a node.

After deploying a new NSX node, the Cluster was running 100% and no issue until now.

Final Node: 

To be able to write this NSX-T log partition full and not able to login blog post and after I fix the partition log issue, I get the help and good tips from Abdullah Abdullah how to fix the Cluster.  And once again, I would like to thank him for his support and time. Please check Abdullah blog doOdzZZ’sNotes. There is great content about NSX.

I hope my first NSX-T blog “NSX-T log partition full and not able to login” post was useful.

Note: Share this article if you think it is worth sharing.

©2020 ProVirtualzone. All Rights Reserved
By | 2020-05-02T21:26:56+02:00 April 22nd, 2020|NSX, VMware|1 Comment

About the Author:

I am over 20 years’ experience in the IT industry. Working with Virtualization for more than 10 years (mainly VMware). I am an MCP, VCP6.5-DCV, VMware vSAN Specialist, Veeam Vanguard 2018/2019, vExpert vSAN 2018/2019 and vExpert for the last 4 years. Specialties are Virtualization, Storage, and Virtual Backups. I am working for Elits a Swedish consulting company and allocated to a Swedish multinational networking and telecommunications company as a Teach Lead and acting as a Senior ICT Infrastructure Engineer. I am a blogger and owner of the blog ProVirtualzone.com

One Comment

  1. […] Home/NSX, VMware/How to upgrade NSX-T 2.4.x or 2.5.x to NSX-T 3.0 Previous […]

Leave a Reply

%d bloggers like this: