I recently set up a Proxmox VE cluster in my lab using VMware as the base. My goal was to test Veeam migrations and understand how Proxmox handles features such as failover and storage. I did most of the work through the command line because it gave me more control and helped me learn what’s happening behind the scenes. During setup, I encountered issues with repository updates and network configuration, but I found ways to resolve them. I also set up NFS storage and tested how the cluster handles failover when a node goes offline. In this post, I’ll share each step I took, the problems I faced, and how I fixed them.
Since numerous articles are already available on how to install Proxmox, I won’t cover that aspect here. It’s relatively straightforward, and countless step-by-step instructions are available online. So, I’ll start with my three Proxmox VE nodes, which are already up and running.
This blog post provides a detailed, step-by-step process, focusing on command-line interface (CLI) commands, supplemented by graphical user interface (GUI) steps, to configure a three-node Proxmox cluster.
Environment Overview
Setup:
- Three Proxmox VE 8.x nodes (
proxmox01,proxmox02,proxmox03), installed as VMs in vSphere 8.x. - Purpose: Testing Veeam migrations and failover scenarios.
- Networking:
vmbr0: Management and cluster communication (192.168.1.x).vmbr1: Dedicated for NFS storage traffic (192.168.10.x).
- Storage: Synology NAS exporting
/volume2/Proxmox_NFSover NFS. - Nested Environment: VMware settings tweaked to support nested virtualization and bonded interfaces.
Why two networks? Separating management and storage traffic prevents contention during high I/O operations and provides better security through network isolation.
Nested Environment Considerations
Since Proxmox was nested in VMware, I installed open-vm-tools for better integration:
1 2 3 4 5 | apt install open-vm-tools -y systemctl enable --now open-vm-tools systemctl status open-vm-tools |
Why Nested and open-vm-tools?
Running Proxmox VE inside VMware vSphere requires open-vm-tools to optimize integration with the hypervisor. This provides:
- Time synchronization: Ensures accurate clock alignment across nodes.
- Graceful shutdown/reboot: Allows clean operations from vSphere.
- IP reporting: Improves visibility in VMware.
- Performance enhancements: Optimizes interaction with virtual hardware.
Important: open-vm-tools is only required for nested environments and should not be installed on bare-metal Proxmox setups, as it’s irrelevant without a virtualization layer, such as VMware.
Proxmox VE Installation and Repository Setup
Proxmox VE is based on Debian, so I began by ensuring the system and kernel were fully updated. The first step was to check if the enterprise repository was enabled:
1 2 3 | cat /etc/apt/sources.list.d/pve-enterprise.list |
Problem: This file pointed to the enterprise repo, causing a 401 Unauthorized error because I didn’t have a subscription.
Fix: I disabled the enterprise repo:
1 2 3 | sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list |
Then I added the free no-subscription repo:
1 2 3 | echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list |
I also ran into the same 401 error with the Ceph repo:
1 2 3 | sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/ceph.list |
Once the repos were sorted, I updated the system and upgraded everything:
1 2 3 | apt update && apt full-upgrade -y |
Finally, I rebooted to make sure the new kernel was loaded:
1 2 3 | reboot |
After the reboot, I checked the kernel version:
1 2 3 | uname -r |
and confirmed it was the latest Proxmox version (like 6.5.x-pve). This resolved the update errors and restored the system to a clean state, allowing it to proceed to the next steps.
Network Configuration
- vmbr0: Linked to
ens192, handles management traffic (GUI, SSH, Corosync). - vmbr1: Linked to a bonded
bond0usingens224andens256for NFS traffic.
Bonding Setup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | auto bond0 iface bond0 inet manual bond-slaves ens224 ens256 bond-mode active-backup bond-miimon 100 auto vmbr1 iface vmbr1 inet static address 192.168.10.17/24 bridge-ports bond0 bridge-stp off bridge-fd 0 |
Note: Same approach on proxmox02 and proxmox03, just change the IP.
Network Bonding Modes Explained
We chose active-backup (mode 1) bonding because:
- Simplicity: Works without special switch configuration
- Reliability: Clear failover behavior
- Nested Environment Compatibility: Works reliably in VMware
Other modes we considered:
- balance-rr (0): Packet round-robin – risk of out-of-order packets
- 802.3ad (4): LACP – requires vSwitch configuration
- balance-xor (2): Not necessarily better for our use case
Troubleshooting Tip: When bonding interfaces in a nested environment, ensure VMware port groups have “Promiscuous Mode” set to “Accept”.
VMware Network Configuration
For our nested Proxmox nodes to work properly, we configured the VMware vSwitches as follows:
Management Network (vmbr0):
- Standard vSwitch
- MTU 1500 (standard)
- Promiscuous mode: Reject (default)
Storage Network (vmbr1):
- Distributed vSwitch recommended
- MTU 9000 (matching Proxmox config)
- Security settings:
- Promiscuous mode: Accept
- MAC address changes: Accept
- Forged transmits: Accept
Critical Note: The storage network requires relaxed security settings to allow proper bonding operation in the nested environment.
VMware vDS Configuration for Nested Setup
Since this is a nested environment, configure the VMware vDS:
- Go to Networking > vDS > Port Group > Edit Settings.
- Security:
- Promiscuous Mode: Accept
- MAC Address Changes: Accept
- Forged Transmits: Accept
- Teaming and Failover:
- Load Balancing: Route based on originating virtual port ID
- Network Failover Detection: Link status only
- Notify Switches: No
- Failback: Yes
Why? These settings enable bonded interfaces to function in a nested environment by allowing MAC address changes and ensuring that traffic flows through virtual NICs.
Best Practices
- Use Active-Backup bonding mode for simplicity and compatibility in nested setups.
- Separate management and storage traffic with distinct subnets.
- Verify connectivity with ping -I vmbr1 <NFS-IP>.
- Consideration: In nested setups, bonding is limited if both vNICs map to the same physical NIC in VMware. Spread vNICs across different uplinks if possible.
Cluster Creation
I created a three-node cluster for high availability (HA) and migration purposes.
CLI Steps
On proxmox01:
1 2 3 4 | pvecm create proxcluster |
On proxmox02 and proxmox03:
1 2 3 4 | pvecm add 192.168.1.16 |
Verify cluster:
1 2 3 4 | pvecm status |
GUI Steps
- Go to Datacenter > Cluster.
- On proxmox01, click Create Cluster, name it
proxcluster. - On proxmox02 and proxmox03, click Join Cluster, enter
proxmox01’s IP and credentials.
Initial Challenges
Problem: Nodes fail to join due to authentication errors.
Root Cause: We hadn’t copied SSH keys between nodes first.
Solution:
# From each joining node
ssh-copy-id root@proxmox01
Best Practices
Ensure all nodes have unique hostnames and that they resolve via /etc/hosts:
1 2 3 4 | nano /etc/hosts |
Add:
1 2 3 4 5 6 | 192.168.1.16 proxmox01 192.168.1.17 proxmox02 192.168.1.18 proxmox03 |
Verify time synchronization:
1 2 3 | timedatectl status |
Configuring Shared NFS Storage
Why NFS?
We chose NFS for our shared storage because:
- Simplicity: Easy to set up and manage
- Compatibility: Works well with Proxmox’s features
- Performance: Good enough for our testing needs
- Snapshot Support: When backed by ZFS on the NAS
Shared storage enables VM migration and HA.
CLI Steps
Install the NFS client on each node:
1 2 3 4 | apt install nfs-common -y |
Verify the NFS export:
1 2 3 4 | showmount -e 192.168.10.198 |
Error Encountered: The original NFS path /volume2/Proxmox NFS had a space, causing mount failures.
Fix: Renamed it to /volume2/Proxmox_NFS on the NFS server.
GUI Steps
- Go to Datacenter > Storage > Add > NFS.
- Enter:
- ID: VMS-Storage
- Server: 192.168.10.198
- Export: /volume2/Proxmox_NFS
- Content: Disk image, ISO image, VZDump backup file
- Nodes: All
- Advanced: Default for Preallocation and NFS Version
- Click Add.
NFS Storage Options
Best Practices
- Avoid spaces in NFS paths (e.g., use
Proxmox_NFSinstead ofProxmox NFS). - Test mounts with
showmount -ebefore adding. - Use NFS 4.1 for better performance if supported by your NAS.
- Monitor the NFS server to avoid bottlenecks.
Verification of the new Storage added to the Cluster(each node) by command line:
1 2 3 4 5 | pvesm status | grep VMS-Storage pvesm list VMS-Storage mount | grep VMS-Storag |
Creating a Test VM for Storage and Migration Tests
To ensure everything was working properly, I created a test virtual machine (VM) immediately after configuring the shared NFS storage. The primary goal was to ensure that the NFS-backed storage was functional and could be used for virtual machine (VM) disks. But beyond that, I also wanted to test how well live migration would work between the Proxmox nodes.
This test VM lets me simulate a real workload. I used it to:
- Verify that the storage was mounted and accessible by all cluster nodes.
- Test live migrations between nodes (
proxmox01,proxmox02,proxmox03) to confirm shared storage and network configurations were solid. - Get ready to configure and test HA, which I will do in the next steps.
After creating the VM, I ran migrations to and from each node. The tests measured the migration times and speeds, providing a baseline for later performance tuning. It was also an opportunity to observe how quickly the VM state was transferred across nodes and to ensure that no data was lost or corrupted during the process.
This is the VM summary board in Proxmox.
Testing Live Migration
After configuring the cluster and setting up shared NFS storage, we tested live migration to confirm everything worked as expected. We used VM 100 as our test VM.
Process
In the Proxmox GUI:
- Right-click the test VM (VM 100).
- Click Migrate.
- Select the target node (e.g., proxmox03).
- Confirm and start the migration.
Migration Log
Here’s a sample log snippet showing what we saw:
1 2 3 4 5 6 7 8 9 10 11 | 2025-06-01 02:52:23 starting online/live migration on unix:/run/qemu-server/100.migrate 2025-06-01 02:52:23 set migration capabilities 2025-06-01 02:52:23 migration downtime limit: 100 ms 2025-06-01 02:52:23 migration cachesize: 512.0 MiB 2025-06-01 02:52:23 start migrate command to unix:/run/qemu-server/100.migrate 2025-06-01 02:52:23 migration started with RAM size 3.9 GiB, VM-state 138.2 MiB ... 2025-06-01 02:52:41 migration finished, total transferred 3.9 GiB |
Highlights:
- Start time: 02:52:23
- VM memory: 3.9 GiB
- Peak transfer rates: Around 100–170 MiB/s
- Downtime limit: 100 ms
Rolling Back to Test Reverse Migration
To make sure migration worked both ways, we repeated the same test:
- Migrated the VM from proxmox01 to proxmox02.
- Migrated again from proxmox02 to proxmox03.
- Finally, migrated it back to proxmox01.
Result: All migrations completed successfully without downtime, confirming that the cluster’s shared storage and network setup were working perfectly for live migrations.
This thorough testing ensured our HA setup was reliable and ready for real-world use.
High Availability (HA) Configuration
High Availability (HA) is essential for ensuring that virtual machines remain online, even if one of the Proxmox nodes fails. It automatically restarts VMs on healthy nodes, ensuring services continue to run without manual intervention. Before setting up HA, I used the test VM I created earlier to verify that live migration and failover between nodes would work properly.
CLI Steps
- Create an HA group:
1234ha-manager groupadd default --nodes proxmox01,proxmox02,proxmox03 - Add the test VM to the HA group:
1234ha-manager add 100 --group default --state started - Check the current HA status:
1234ha-manager status
Expected output:
12345678quorum OKmaster proxmox01 (active, ...)lrm proxmox01 (active, ...)lrm proxmox02 (idle, ...)lrm proxmox03 (idle, ...)service vm:100 (proxmox01, starting)
HA Configuration in GUI Steps
- In the Proxmox web interface, go to Datacenter > HA > Groups, and create a new group named
default, adding all three nodes. - Then go to Datacenter > HA > Resources, click Add, and add the test VM to the
defaultgroup. Set its state to started.
Testing Failover Scenarios
To test how HA behaves during failures, I simulated a node failure:
CLI Steps
- Shut down the primary node (
proxmox01):
1234poweroff - Check logs on another node to monitor the failover process:
12345journalctl -u corosyncjournalctl -u pve-ha-lrm
Failover Timeline
- Total failover time: About 3 minutes
- Cluster detection: Around 5 seconds
- VM restart: Roughly 3 seconds
One thing I noticed during these tests is that if you’re coming from a VMware environment, don’t expect Proxmox HA to behave the same way as vSphere or vCenter HA. It’s not as fast or as seamless; it takes a bit more time for Proxmox to detect a node failure and trigger the migration. This means you may experience more downtime for your VMs than you’re accustomed to. At least in my testing, I didn’t find a way to make it faster, although perhaps there’s a different configuration I missed, as someone new to Proxmox. Another thing to note is that if you’re connected to a node that goes down (like through the web GUI), you’ll also lose your session to the cluster. No virtual or management IP address remains active when a node is down, so you must reconnect manually to another node. It’s something to keep in mind when planning your setup.
Improving HA Performance
Although the initial failover was successful, I wanted to reduce the 3-minute delay. By adjusting Corosync’s timeout settings, I was able to speed up detection and reaction times.
CLI Steps
- Edit the Corosync configuration:
1234nano /etc/pve/corosync.conf
Inside thetotemsection, I added: - 123456## HA performance tuningtoken: 1000consensus: 8000
Here’s the final relevant part:
12345678910111213141516totem {cluster_name: ProxmoxClusterconfig_version: 3interface {linknumber: 1}ip_version: ipv4-6link_mode: passivesecauth: onversion: 2token: 1000consensus: 8000} - Apply the new settings:
1234systemctl restart corosync
Note: These settings should be edited on a single node, as the configuration is synced across the cluster. Lowering these timeouts helps the cluster react more quickly to failures, but be cautious, if they’re too low, they can cause false failovers due to brief network hiccups.
Summary of HA Best Practices
- Always test HA with a real VM to confirm everything works as expected.
- Monitor logs (
corosyncandpve-ha-lrm) to see how fast failover happens. - Balance performance and stability when tuning Corosync — start with moderate values and test carefully.
- Nested Environment Note: HA testing in nested environments helps identify quirks that may not be apparent on physical servers.
This approach provided me with a significantly faster failover time and increased confidence in the cluster’s stability during node outages.
Conclusion
Setting up this Proxmox VE cluster was a thorough and eye-opening experience. Along the way, I encountered some common pitfalls, including issues with repository configurations and networking. One of the bigger challenges was dealing with spaces in names – both for NFS exports and VM names – which Proxmox doesn’t handle well, causing some frustrating errors. Networking also required careful setup, especially in a nested environment, and HA required some extra work to run reliably.
In my opinion, Proxmox has made significant progress over the last few years, but it still lags behind VMware’s vSphere and vCenter in terms of large, enterprise environments. For small and medium-sized setups, Proxmox is a solid alternative; however, if you have critical workloads and require rock-solid high availability (HA), you may want to consider other options for now. Another important point is that organizations should recognize that community support alone is insufficient for production environments. It’s best to invest in professional support directly from Proxmox or a trusted partner.
One thing I noticed during these tests is that if you’re coming from a VMware environment, don’t expect Proxmox HA to behave the same way as vSphere or vCenter HA. It’s not as fast or as seamless; it takes a bit more time for Proxmox to detect a node failure and trigger the migration. This means you may experience more downtime for your VMs than you’re accustomed to. At least in my testing, I didn’t find a way to make it faster, although perhaps there’s a different configuration I missed, as someone new to Proxmox. Another thing to note is that if you’re connected to a node that goes down (like through the web GUI), you’ll also lose your session to the cluster. No virtual or management IP address remains active when a node is down, so you must manually reconnect to another node. It’s something to keep in mind when planning your setup.
Another area that Proxmox should improve is cluster management during failures. Currently, if the node you’re connected to goes offline, you lose access to the GUI and must reconnect manually to a different node. A management or virtual IP that stays up even when a node goes down would make the cluster easier to manage and more professional. This seems like something Proxmox could implement fairly easily, and it would be a great step towards making it a more enterprise-ready platform.
Overall, Proxmox has considerable potential and could become a strong competitor in the next few years, particularly with continued investment and community support. I’m looking forward to seeing how it evolves. Now, I’ll start working on my VMware migrations to Proxmox using Veeam, and in future posts, I’ll share step-by-step details of that process.
Share this article if you think it is worth sharing. If you have any questions or comments, comment here, or contact me on Twitter (yes, for me it is not X, but still Twitter).
Leave A Comment