/NSX-T log partition full and not able to login

NSX-T log partition full and not able to login

A couple of weeks ago, we had a strange issue with one of our NSX-T clusters. NSX-T log partition full and not able to login.

Since I am not the first line support in this system, they could not login to NSX-T console or through ssh, but sometimes GUI as possible. They thought was a user/password problem and try to reset the password a couple of times. That was the wrong decision; they should not do that.

This did not work and triggered more issues within the NSX-T nodes, and there were no NSX-T Backups set in this environment.

Note: When resetting the root/admin password for an NSX-T Cluster, do not forget that you need to change the file “/config/vmware/nsx-node-api/reset_cluster_credentials” if not, 5 minutes after, the system will rollback the password for the old password (since is not possible to sync with other nodes and NSX-T Cluster needs to know that password was changed by touching this file).

Before all NSX-T went down, it was possible to check in the GUI the source of the issue, was partition /var/log was 100%, then with some dump files(a lot of bad logins, reboots, etc.), also /var/dump was full.

The environment is an NSX-T 2.4.1 Cluster with three nodes.

NSX-T Error

It started with one node degraded (when logs were still 96/97%) and then quickly change to 100%.
NST-T log partition full and not able to login

Then all nodes in the cluster went to error, and after a while, all were down.

NST-T log partition full and not able to login

When trying to connect to GUI, all services were down and also unknown status.

NST-T log partition full and not able to login

Note: For security reasons, I have changed IP and hostnames in the images.

Since I could not login to the console or through ssh, I need to find a way to delete these logs and try to recover the nodes.

Run Ubuntu on the NSX node.

The solution I found was to use Ubuntu to reset the NSX-T root/admin password; you can download it from HERE.

Rebooting the NSX-T node and booting with Ubuntu, we can access the partitions and delete some log files and try to recover the NSX-T Cluster.

Add the ISO to a Datastore and add it to your NSX-T VM and reboot the VM with the Ubuntu.

Select you and language and then “Rescue a broken system.”

NST-T log partition full and not able to login

Then you will get that Ubuntu cannot have a network(DHCP or static). Just continue and select the option “Do not configure network at this time.” We will not need a network to perform these tasks.

NST-T log partition full and not able to login

Next, do:

  • Hostname: Leave the default “ubuntu”.
  • Configure the clock: Select your time zone.

Next, select the partition you want to work on. Select the device to work as root. For this case we need to work in /var/log/ that is /dev/nsx/var+log/ we can select directly the partition we need.

But we could use the option “Do not a root file system” and then mount the /var/log/. But since I do not need anything else than working in the /dev/nsx/var+log/, I select just a partition that will mount that folder that will be mounted in /target/.

Next, select “Execute a shell in the installer environment” so that we can work in the partition we need.

As we can notice in the next image, Ubuntu will boot with the folder /dev/nsx/var+log/ mounted in /target.

Check and delete logs.

Now we have access to the partition. We check that it is 100% full and will check which logs are filling this partition.

Then check the logs, I notice that two were big and were the ones that triggered the issue. These were syslog.1 and auth.log.1.

Those two logs were 80% of the partition usage. So after I deleted it, it drops to 21% usage.

After rebooting the NSX-T node, I was able to log in (before I needed to reset the password again since I did not know which password was changed). I also need to fix the /var/dump/ partition before recovering the nodes.

But before I try to check the logs, I could see a lot of errors with sync and problem in the CBM.

But most of these issues were because the Cluster was down and nodes not accessible. So I could not find any real issue that triggers this problem. But I think the problem was feeding the logs for many days until they reach 100% of the partition.

Since there was no useful information in syslog.log, cmb.log  to understand what was wrong and why this happens, I did not get much information since I deleted the previous logs.

Fix NSX-T Cluster.

After doing the above tasks in all NSX nodes in the Cluster to fix each one, I rebooted and login to the console.

Checking Cluster status, all were still with UNKNOWN status and down.

To try to fix the NSX nodes, I power off 01 and 03 and reboot 02 and try to recover this one only.

After some minutes, 02 was working, and in GUI status was green. So I power on 01 and 03 and wait a while so that they could sync. After 30m, 01 was working, but 03 was still down.

I was able to login to node 03, but with each command in the NSX admin console, I get: “% The get cluster config operation cannot be processed currently, please try again later.”

After several attempts to recover this node 03, it was decided that it was best was to detach this node from the Cluster and deploy a new one.

How to detach a node.

After deploying a new NSX node, the Cluster was running 100% and no issue until now.

Final Node: 

To be able to write this NSX-T log partition full and not able to login blog post and after I fix the partition log issue, I get the help and good tips from Abdullah Abdullah on how to fix the Cluster.  And once again, I would like to thank him for his support and time. Please check Abdullah’s blog doOdzZZ’sNotes. There is great content about NSX.

I hope my first NSX-T blog, “NSX-T log partition full and not able to login,” post was useful.

Note: Share this article if you think it is worth sharing.

©2020 ProVirtualzone. All Rights Reserved
By | 2021-02-01T15:32:06+01:00 April 22nd, 2020|NSX, VMware Posts|3 Comments

About the Author:

I have over 20 years of experience in the IT industry. I have been working with Virtualization for more than 15 years (mainly VMware). I recently obtained certifications, including VCP DCV 2022, VCAP DCV Design 2023, and VCP Cloud 2023. Additionally, I have VCP6.5-DCV, VMware vSAN Specialist, vExpert vSAN, vExpert NSX, vExpert Cloud Provider for the last two years, and vExpert for the last 7 years and a old MCP. My specialties are Virtualization, Storage, and Virtual Backup. I am a Solutions Architect in the area VMware, Cloud and Backup / Storage. I am employed by ITQ, a VMware partner as a Senior Consultant. I am also a blogger and owner of the blog ProVirtualzone.com and recently book author.

3 Comments

  1. […] Home/NSX, VMware/How to upgrade NSX-T 2.4.x or 2.5.x to NSX-T 3.0 Previous […]

  2. […] 10. NSX-T log partition full and not able to login > ProVirtualzone … […]

  3. […] 6. NSX-T log partition full and titinada able to login > ProVirtualzone … […]

Leave A Comment