An error occurred during configuration of the HA Agent on the host

A funny thing happened whilst configuring a new ESX3.5 cluster, part of my process was to enable HA on my hosts.


After checking the “Enable VMware HA” box and making a few changes to the default settings my Hosts begin to Configure HA.

All seemed to be going well until I was presented with this;

Now I have had this error before on systems and I remembered that this could be to do with DNS issues, but first things first I tried to disable and re-enable HA a few times in case there was a “glitch” but this made no difference.

Here are a few of the other initial checks I made:

  • I had gone through and checked all of my DNS settings to make sure everything is set in lowercase
  • I had checked all connectivity, between VC, Host1, Host2 by pinging FQDN’s, using vmkping, everything could connect to each other.
  • I rebooted VC,
  • I reboot both hosts
  • I disconnected and reconnect both hosts.
  • I removed and recreated a new Cluster and added the hosts back in.

Nothing fixed my issue. 🙁

After revisiting my settings several times, I noticed on the Summary tab of the Cluster in the VMware HA box it said this:

  • Current Failover Capacity: 0 Hosts
  • Configured Failover Capacity: 1 Host

Clearly this wasn’t correct, so I decided to look closer at the HA agent’s installed on the host themselves.  I decided to have a look at the HA agent log (aam_config_util_addnode.log) this is found at the following location on the ESX host;

cat /var/log/vmware/aam/aam_config_util_addnode.log

Whilst looking through the log I noticed that I wasn’t getting a ping response from my Default Gateway;

I knew my Gateway was working fine as my workstation was configured to use it and that was functioning fine.  Now this Gateway is an interface on a Firewall and that interface had been configured to discard ping requests.   Eureka moment,  Was this the reason why the HA agent would not configure?  Because it couldn’t receive a ping response from the Default Gateway, it thinks it doesn’t exist so it is a misconfiguration error.
So lets Test this theory.  I re-set the Gateway for HA to our secondary Default Gateway which DID allow ping requests and low and behold after changing the Service Console Gateway settings and re-enabling HA on the cluster it all sprang in to life,  both Hosts were now HA enabled!!!!


Here is what the log shows now that the HA agent can see the Default Gateway.

Simon is a Senior Systems Specialist of a major finance house  you can follow him on twitter

11 thoughts on “An error occurred during configuration of the HA Agent on the host”

  1. It makes sense to me. The gateways is used as an isolation response address. It’s the address that’s being used to check if the host is isolated or not. One other solution that you could have used:
    Open up advanced options and add an extra isolation address.
    das.isolationaddress[x] = 10.0.0.1
    das.usedefaultisolationaddress = false

    The second option tells ha not to use the default gateway as an isolation response address. This way you can still have your standard network config and use HA.

  2. I might actually do that Duncan, i think it might save confusion in the future when someone else looks at it and says “Thats not the correct Gateway” and changes it…

    Simon

  3. I would most definitely do it for exactly that reason. Standardizing your environment will help with troubleshooting and preventing downtime in the end in my opinion.

    Nice article by the way!

  4. Thanks for this solution! I’ve been looking all over the web for a solution to this issue (I couldn’t believe that HA was so difficult to set up!!).

    A quick question, what do you guys suggest to put as an isolation address? DNS? vCenter?

  5. Hi Clayton, glad someone found it useful 🙂 I used a second gateway that we have, but im guessing vCenter would be a good idea. Duncan might have a better idea.

    Simon

  6. Thanks, I was facing the exact same problem and got it fixed thanks to this post.
    I can imagine a lot of gateways don’t reply to pings, so why doesn’t VMware have this better documented?
    Anyway, glad to have it up and running

  7. Simon, you saved the day. I had read this earlier, but never experienced the problem until now this morning after I had to do a reinstall of my VirtualCenter, on a new server, and reconfiguring the HA agents on my hosts.

    The advanced settings Duncan provided, helped me work around the problem until the ICMP reply issue with our gateway was resolved.

    Once again, I can haz HA! 🙂

  8. I have 2 default gateways on my network, one is a MPLS router, one is a checkpoint firewall. On the internal network (i.e., clients that don’t need to see the rest of the MPLS) I use the checkpoint firewall… Unless you create a rule on the firewall, pings to are disabled. Sure enough, putting a rule in allowing the ESX hosts to ping the firewall worked a treat. If I wanted to change to the other default gateway on the ESX hosts to use the MPLS router, how would I go about doing this?

    Thanks for the guide though – saved a lot of troubleshooting & rebuilding ESX hosts!

    Sam

Comments are closed.