Vsphere total: Recover a failed vCenter HA

The procedure to recover a failed vCenter HA takes place when Active, Passive and Witness nodes could not communicate with each other making the vCenter HA cluster non-functional.
Since the HA cluster doesn't support more than a single point of failure, the service availability is impacted and you need to restore the vCenter functionality to keep your infrastructure healthy.

recover-failed-vcenter-ha-02

vCenter HA shutdown sequence

If for any reason you need to reboot or shutdown the vCenter HA, you must follow a specific sequence to keep current roles:

Passive node
Witness node
Active node

You can restart nodes in any order.

Recover a failed vCenter HA

One reason of HA cluster failure is when nodes become isolated. Nodes cannot communicate to each other affecting the vCenter availability.

Check for connectivity issues

To troubleshoot connectivity issues, access the Active node through the Direct Console, login as root and enable the Bash shell. Run the command ifconfig to check network availability:

# ifconfig -a
recover-failed-vcenter-ha-03

In the example, the Eth0 doesn't have an assigned IP address.
Run the following command to check the vCenter's NICs operational status:

# networkctl
Recover a failed vCenter HA 1

As shown in the screenshot, Eth0 is not operational.
Run the following command to get additional details about Eth0:
# networkctl status eth0
recover-failed-vcenter-ha-05

Since the Eth0 NIC is not functional, try restarting the network service:
# systemctl restart systemd-networkd

Check the Eth0 NIC State once again:
# networkctl status eth0
recover-failed-vcenter-ha-07

The Eth0 State is now routable. Reboot the node to verify if the configuration remains permanent.
When the node has been rebooted, check the installed NICs status:
# ifconfig -a

recover-failed-vcenter-ha-08

Eth0 is still misconfigured causing the isolation of the node.
If the nodes connectivity is restored successfully, isolated vCenter HA nodes rejoin the cluster automatically and the Active node starts serving client requests again. If the connectivity issue cannot be solved, you need to recover the vCenter availability.

Remove the HA cluster configuration

If connectivity is not restored, the solution is to remove the HA cluster to have the Active node up and running again.
First step is to power off and delete both Passive and Witness nodes.
recover-failed-vcenter-ha-09

Login as root to the Active node via Direct Console and run the following command to remove the HA cluster configuration:
# destroy-vcha
recover-failed-vcenter-ha-10

If you get a warning message that stops the process, run the command again by appending the -f parameter:
# destroy-vcha -f
recover-failed-vcenter-ha-11

When the procedure has completed, reboot the node:
# reboot
After rebooting the Active node, check the network status:
# ifconfig -a
recover-failed-vcenter-ha-12

Eth0 has the correct IP Address and the vCenter Server is back on line.
recover-failed-vcenter-ha-13

Once the vCenter availability has been restored, the vCenter HA cluster can be rebuilt once again

Vsphere total

Friday, May 22, 2020

Recover a failed vCenter HA

vCenter HA shutdown sequence

Recover a failed vCenter HA

Check for connectivity issues

Remove the HA cluster configuration

No comments:

Post a Comment

NSX-V vs NSX-T – Basic Comparison