vSphere High Availability is one of the core features of the platform. This is why many folks get into virtualization in the first place. However, if configured improperly you may end up with unexpected results. Also, if like me you adopt an environment that has been around since the 4.0 days you need to be very mindful of the host isolation settings. Lets take a look at the HA options you have.
This one is pretty straight forward. If vCenter determines that a host has failed it will take one of two configurable actions. Most commonly you would select “Restart VMs” here. If for some reason you have this set to disabled then no action will be taken the VMs will simply go offline. How vCenter determines a host failure is important and I will discuss this in the next option.
This one is a bit more tricky as we need to fully understand what host isolation is and how vCenter determines a host is isolated vs failed. An isolated host still has access to its storage and networking but has become isolated from vCenter and the other hosts in its cluster. In this scenario the VMs running on the host are fine. They are contently running and taking any HA action would result in an outage when one may not be required. Often times simply restarting the management agents on a host or replacing a network cable on the management nic may resolve the issue. So you ask.. how does vCenter distinguish an isolated host from a failed host. Well there are two methods it will use to determine isolated/failed state.
- Management Network
- The first and most obvious method is that the vSphere HA agent has lost network connectivity to the other hosts in the cluster. If the host cannot reach other hosts over the management network it attempts to ping its “isolation address”. If it can ping the isolation address it is determined to be isolated then the host will execute its isolation response. If it cannot reach the other hosts and also cannot ping its isolation address it is determined to be failed and the host failure response is executed.
- From the VMware KB:
- A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts and it is unable to ping its isolation addresses.
- Datastore Heartbeats
- With only the management network as a monitoring option you could experience some HA fail-overs just by losing a network cable to the management address or losing the NIC/switch port. This in my opinion would cause an bad HA response because the VMs are likely still running fine because like a good sysadmin you have multiple connections for your data and storage networks right?
- Enter.. datastore heart-beating. After enabling datastore heart-beating your hosts will write tiny bits of information to the datastores you designate. At this point a host is considered “isolated” if the management network is down but the other hosts can still see that it is updating its information on the datastore. In order for this host to be considered “failed” it needs to lose both the management connection above AND no longer update its files on the datastore. This is in my opinion the best way to determine a host is failed vs isolated.
IMPORTANT!! – In ESXi 4.0 days the default response to host isolation was to power off the vm’s and restart them on a new host. Upgrading from a 4.0 host does not change this setting. It retains its old 4.0 default settings. If you prefer to now use datastore heartbeats along side network monitoring and leave VMs powered on in an isolated state then you must manually change this setting. If you installed ESXi from 5.0 and above the default setting was changed to leave the VMs running.
Datastore with PDL
This is another pretty straight forward condition. If a host determines that it has permanently lost access to one of its datastores it will execute the PDL response you choose. Permanent device loss is very similar to the next option “all paths down” except for one distinction. The host has determined either based on the error code it received or by repeated re-connection attempts that the device loss is permanent not temporary.
Datastore with APD
As discussed above this is the All Paths Down state. If the host has determined that it has lost all connectivity to one of the storage devices but the device is not permanently disconnected then the APD response is executed. It will continue trying to re-establish the connections to the device. This often leads to a host that becomes unresponsive in vCenter and possibly even via the console due to the overhead created by all the reconnect requests. There is often no coming back from this situation unless you can fix it quickly on the storage or network side.
VM Monitoring is the last option we have for vSphere HA. It is probably also the least used option. This actually monitors individual vm’s based on heartbeats received from the vmware tools installed inside the guest OS. In my opinion this is pretty risky. There are so many things that could go wrong inside the OS that would kill vmware tools heartbeats but the application it is intended to run is still fine. I would suggest if you want to use this option you should have active-active applications that can tolerate nodes going down for reboots more often at least until you can sort out what is causing the heartbeat failures and stabilize the environment.