A funny thing happened whilst configuring a new ESX3.5 cluster, part of my process was to enable HA on my hosts.
After checking the “Enable VMware HA” box and making a few changes to the default settings my Hosts begin to Configure HA.
All seemed to be going well until I was presented with this;
Now I have had this error before on systems and I remembered that this could be to do with DNS issues, but first things first I tried to disable and re-enable HA a few times in case there was a “glitch” but this made no difference.
Here are a few of the other initial checks I made:
- I had gone through and checked all of my DNS settings to make sure everything is set in lowercase
- I had checked all connectivity, between VC, Host1, Host2 by pinging FQDN’s, using vmkping, everything could connect to each other.
- I rebooted VC,
- I reboot both hosts
- I disconnected and reconnect both hosts.
- I removed and recreated a new Cluster and added the hosts back in.
Nothing fixed my issue. 🙁
After revisiting my settings several times, I noticed on the Summary tab of the Cluster in the VMware HA box it said this:
- Current Failover Capacity: 0 Hosts
- Configured Failover Capacity: 1 Host
Clearly this wasn’t correct, so I decided to look closer at the HA agent’s installed on the host themselves. I decided to have a look at the HA agent log (aam_config_util_addnode.log) this is found at the following location on the ESX host;
Whilst looking through the log I noticed that I wasn’t getting a ping response from my Default Gateway;
I knew my Gateway was working fine as my workstation was configured to use it and that was functioning fine. Now this Gateway is an interface on a Firewall and that interface had been configured to discard ping requests. Eureka moment, Was this the reason why the HA agent would not configure? Because it couldn’t receive a ping response from the Default Gateway, it thinks it doesn’t exist so it is a misconfiguration error.
So lets Test this theory. I re-set the Gateway for HA to our secondary Default Gateway which DID allow ping requests and low and behold after changing the Service Console Gateway settings and re-enabling HA on the cluster it all sprang in to life, both Hosts were now HA enabled!!!!
Simon is a Senior Systems Specialist of a major finance house you can follow him on twitter