I thought it might be worth sharing the outcome of some troubleshooting of some ESXi 5 servers running off SD cards, in case anyone else encounters similar issues.
When vSphere 4 proved capable of booting ESXi from a 4Gb SD card or USB stick it became a tempting option for saving money on storage with new deployments. The HP Proliant G6s even came with an onboard SD card slot to make it easy to leave those disk slots empty.
However, ESXi 5 moved the goalposts, with USB or SD storage no longer suitable for persistent storage: hosts without local storage now mount a 32Mb RamDisk to mount the scratch partition to.
After upgrading to ESXi 5 we saw several instances of host instability as the Ramdisk ram out of space, symptoms including :-
- Changes to VM settings failing
- VMotion to or from the server failing (so we couldn’t put the server into Maintenance Mode for a safe reboot)
- Local logs not updating
- Host reporting as disconnected status within vCenter
The VMs themselves continued to run on the host, just not able to manage the host properly.
As a short term fix to bring the server under control again – providing the Shell is enabled – we logged onto the shell, navigated to the /var/log folder and deleted any archived log files (ending .gz) and created a new, empty wtmp file (the file recording shell access). That recovered enough Ramdisk space for the server to become manageable again.
In our case the issue was caused by trying to monitor the server with PRTG, which gathers information by logging onto the shell over SSH every few minutes, each time writing to wtmp and rapidly growing the file (see link below for more detail), though we also managed to fill the Ramdisk through normal logging after disabling PRTG monitoring.
The long term fix for this is to repoint the scratch partition to persistent storage, as per KB article 1033696 (link below). We deployed a single 50Gb LUN called “Logs” just for the purpose of storing log data for our 5 hosts, then changed the Advanced Settings parameter ScratchConfig.ConfiguredScratchLocation to /vmfs/volumes/Logs/.locker-servername and rebooted the server. You can browse the Logs datastore and see the timestamp on the contents of the locker folder to verify it’s working. Just make sure each server gets a unique folder.
We tested booting the ESXi 5 host without access to the datastore (by disabling the iSCSI switch ports) and proved that the host still comes up OK – it reverts to the Ramdisk as described in the KB. If lost you need to configure the parameter again and bounce the server.
A related change we also made was to set up syslogging for a larger timeframe (full details via link below) :-
- We installed the VMware Syslog Collector application onto our vCenter server (available on the vCenter ISO setup menu)
- for each ESXi 5 server change the Advanced Settings > Syslog > Global parameters to forward logging to the vCenter server
- syslog.global.defaultRotate = number of logs to keep (0-100)
- syslog.global.defaultSize = size of each log before rotate (0-10240KiB)
- syslog.global.logHost = IP address of vCenter or syslog server
The Syslog server creates a subfolder under ..\VMware Syslog Collector\Data\ for the Management IP of each host sending Syslog data.
Ramdisk is full: http://communities.vmware.com/message/2026032
Investigating disk space : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003564
Configuring Syslog on ESXi 5.0: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2003322
Configure persistent scratch location: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033696
ESXi 5.0 monitoring with PRTG: http://www.paessler.com/knowledgebase/en/topic/32963-esxi-5-vsphere5-with-prtg