For a single remote ESX 3.0.2 host it was deemed time to upgrade to ESX3.5 Update 5 today. The plan was to run the headless upgrade using the zip. Boy did that work out a little different as planned.
Why not go to vSphere update 1 I can hear many people think. Well this is an isolated production host that will not gain that much by going to vSphere right now. The most important bit was to upgrade fast and easy and keep uptime high. The host had been running flawlessly for over 2 years now, so it was high time to get it back on track a bit again. That will at least make it possible to get the latest OS’s installed in a supported way. That was the main goal of this update, improved security a good second one.
So this morning I started the upgrade, pretty much like the book, unzip the upgrade from 3.0.2 to 3.5, put the host in maintenance mode and started the first upgrade. Can’t upgrade it straight into 3.5 Update 5 using the headless mode as Update 3.5 Update 3 is the latest version to directly upgrade to. Anyways, I’m getting ahead of myself. So after unzipping the files, I ran the command:
esxupdate -r file:/var/updates/3.5.0-64607 -n update
and for a little while things went fine. Until after installing the kernel-source-2.4.21-47.0.1.EL.64607.i386.rpm rpm, the upgrade process stalled for about 20 minutes and that wasn’t quite normal. On hind side I now know I should have rebooted the host, before starting the upgrade, so that there would be no funny left overs from running fine for that long…
Anyways, I cancelled the upgrade by using Ctrl+C and tried rebooting the host using various command line tricks. None of them worked, so resorted back to the “Use a hammer” method and yanked the APC power using a remote reset option. Twenty minutes later, still no access to the host. Hurray, time to go to the data center and fix it locally.
As expected, the host didn’t boot and was crashed during boot. Well I guess that is why you plan an update, to not have to travel to the data center in the middle of the night, but during day time. So I inserted the ESX 3.5 Update 5 CD and ran the upgrade that way. That seemed to work fine.
After the update the host rebooted one more time and the host banner logon screen looked fine. However…
When I tried to connect to the host remotely, I could not access the console via SSH nor any of the guests. Only by connecting to the server directly using my laptop I was able to connect. What is wrong now? I noticed the NIC lights did not blink at all when connected to the internet side of things. A quick examination on both sides of the story told me it worked fine on either end, just not when connected. Hmmm…. scratching head here….
Then it finally dawned on me that the uplink was only 100 Mbit and not a 1Gbit link, while the NICs where all 1GBit. So maybe going into the VI Client using the laptop and toggling the host NIC from 1GBit to 100Mbit might fix it? Yes it did fix this problem. My God, why did that change? It was not like I had it set to 1GBit before as it worked fine before the upgrade.
So my main VM’s quickly sprang to life again and life seemed rosy again. Yep, you read that right, seemed rosy 😀
Some VM’s refused to run and threw errors like “The attempted operation cannot be performed in the current state (Powered Off)”, the VI Client toolbar buttons displayed “No Privilege” along the normal labels. Say that again? I log in as root and there’s insufficient rights to start the VM? Hummm… So much fun, so I restored a backup and tried to start that one, resulting in the exact same problem. Finally I looked at the vmware.log and there it was staring me right in the face:
Dec 28 16:30:45.176: vmx| Msg_Post: Error Dec 28 16:30:45.176: vmx| [msg.configrules.validate.failed.reject] Invalid value "/home/wil/CDs/Win2k3_Web_VLK.iso" for configuration key "ide0:0.fileName ". Value was rejected by rule "No System Files". Dec 28 16:30:45.176: vmx| [msg.main.rulesfile.failedtopass] Virtual machine not configured according to rules specified in /etc/vmware/configrules. Dec 28 16:30:45.176: vmx| Dec 28 16:30:45.176: vmx| ---------------------------------------- Dec 28 16:30:45.214: vmx| Flushing VMX VMDB connections Dec 28 16:30:45.227: vmx| IPC_exit: disconnecting all threads Dec 28 16:30:45.227: vmx| VMX exit (-18). Dec 28 16:30:45.227: vmx| VMX has left the building: -18.
Ok that makes sense, so commented out the line that defined the iso image and tried booting the problem VM. Problem solved and host is back up to date again.