When RamDisks Get Full

I thought it might be worth sharing the outcome of some troubleshooting of some ESXi 5 servers running off SD cards, in case anyone else encounters similar issues.

When vSphere 4 proved capable of booting ESXi from a 4Gb SD card or USB stick it became a tempting option for saving money on storage with new deployments.  The HP Proliant G6s even came with an onboard SD card slot to make it easy to leave those disk slots empty.

However, ESXi 5 moved the goalposts, with USB or SD storage no longer suitable for persistent storage:  hosts without local storage now mount a 32Mb RamDisk to mount the scratch partition to.

After upgrading to ESXi 5 we saw several instances of host instability as the Ramdisk ram out of space, symptoms including :-

  • Changes to VM settings failing
  • VMotion to or from the server failing (so we couldn’t put the server into Maintenance Mode for a safe reboot)
  • Local logs not updating
  • Host reporting as disconnected status within vCenter

The VMs themselves continued to run on the host, just not able to manage the host properly.

As a short term fix to bring the server under control again – providing the Shell is enabled – we logged onto the shell, navigated to the /var/log folder and deleted any archived log files (ending .gz) and created a new, empty wtmp file (the file recording shell access).  That recovered enough Ramdisk space for the server to become manageable again.

In our case the issue was caused by trying to monitor the server with PRTG, which gathers information by logging onto the shell over SSH every few minutes, each time writing to wtmp and rapidly growing the file (see link below for more detail), though we also managed to fill the Ramdisk through normal logging after disabling PRTG monitoring.

The long term fix for this is to repoint the scratch partition to persistent storage, as per KB article 1033696 (link below).   We deployed a single 50Gb LUN called “Logs” just for the purpose of storing log data for our 5 hosts, then changed the Advanced Settings parameter ScratchConfig.ConfiguredScratchLocation to /vmfs/volumes/Logs/.locker-servername and rebooted the server.   You can browse the Logs datastore and see the timestamp on the contents of the locker folder to verify it’s working.  Just make sure each server gets a unique folder.

We tested booting the ESXi 5 host without access to the datastore (by disabling the iSCSI switch ports) and proved that the host still comes up OK – it reverts to the Ramdisk as described in the KB.  If lost you need to configure the parameter again and bounce the server.

A related change we also made was to set up syslogging for a larger timeframe (full details via link below) :-

  • We installed the VMware Syslog Collector application onto our vCenter server  (available on the vCenter ISO setup menu)
  • for each ESXi 5 server change the Advanced Settings > Syslog > Global parameters to forward logging to the vCenter server
    • syslog.global.defaultRotate  =  number of logs to keep (0-100)
    • syslog.global.defaultSize = size of each log before rotate (0-10240KiB)
    • syslog.global.logHost = IP address of vCenter or syslog server

The Syslog server creates a subfolder under ..\VMware Syslog Collector\Data\ for the Management IP of each host sending Syslog data.

 

Ramdisk is full:  http://communities.vmware.com/message/2026032

Investigating disk space :  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003564

Configuring Syslog on ESXi 5.0:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2003322

Configure persistent scratch location:  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033696

ESXi 5.0 monitoring with PRTG:  http://www.paessler.com/knowledgebase/en/topic/32963-esxi-5-vsphere5-with-prtg

 

Advertisements

Virtualisation 101 – Physical to Virtual (P2V) Migrations

Following on from last month’s article on deploying brand new virtual machines Simon this month moves on to looking at how to migrate existing Physical or Virtual machines running on earlier or competing virtualisation platforms to vSphere.

Continue reading “Virtualisation 101 – Physical to Virtual (P2V) Migrations”

Virtualisation 101 – VMotion

What Is It?

VMotion is arguably VMware’s “killer app” – the feature that gave VMware’s hypervisor product a USP edge over its competition.    It enables an ESX host to transfer a running virtual machine over to a different ESX host without incurring downtime.  

When a VMware administrator initiates a VMotion migration the memory state of the chosen virtual machine is copied via a dedicated network link from the source host to target host;  when completed the target host registers the guest machine, attaches the virtual NICs to its own vSwitch(es) and takes control of the guest.

The handover happens so smoothly that network connections are maintained, rendering the process invisible to users, who at worst see the server pause for a second or two.

This feature effectively separates the physical hardware from the operating system, resulting in major benefits to business :-

  • A running virtual machine is no longer dependent on a single piece of hardware, reducing the risk of service outages should a hardware failure occur.
  • There is no need to perform planned hardware maintenance outside of working hours, reducing costs and improving responsiveness.
  • Workloads can be juggled across servers to best utilise the resources available: if an ESX host gets busy, guests can be moved off to a less busy hosts until balance is restored, improving efficiency.

What Do I Need to Deploy It?

Two ESXi/ESX hosts with compatible CPUs:    The target host must support the same processor features as the source host, otherwise the virtual guest could issue a command that the host cannot understand and result in the guest crashing.   For example you cannot migrate from an Intel server to an AMD server.  There are VMotion Compatibility Guides that group compatible server types together (see links below).   

New to vSphere, Enhanced VMotion Compatibility (EVC) can be enabled on a cluster to improve compatibility by checking with the hosts and calculating a CPU “mask” – a list of features supported by all hosts in the cluster.   As a last resort, and unsupported by VMware, a custom CPU mask can be defined manually (see the KB article linked to below for more details).

VMware Licensing:    Both hosts must be licensed with either Essentials+, Advanced, Enterprise or Enterprise+.

vCenter Server:   Source and target hosts must be managed by the same vCenter server (or linked vCenters), as migrations are initiated from the vCenter server, either from vSphere Client or the Move-VM cmdlet in Powershell.

VMotion Network:    A vmkernel interface (with an IP address separate from the Service Console or Management interface) must exist on both hosts for the express purpose of VMotion comms.

Because of the time-critical nature of the migration it must be a fast link (bandwidth >622Mbps, latency <5ms round trip) so Gigabit is required.  For resilience the vSwitch should have two or more NICs from different physical switches. 

Also note that VMotion data is NOT encrypted — and therefore insecure — so it is recommended that a dedicated VLAN and IP range be allocated for the VMotion interface.

Finally, if the VMotion network traverses a firewall then tcp port 8000 needs opening up.

Shared Storage:   The underlying files that make up the guest virtual machine must be accessible by both source and target host.  

Is VMotion Safe?

Generally VMotion is very safe, with any errors reported in the Task Pane and the migration aborted.  There are a few circumstances where VMotion isn’t possible :-

  • If the VM has a resource attached that is not available on the target – for example if a mapped CD ISO is stored on the source host’s local datastore.  (Storing ISOs on a shared LUN avoids this issue.)
  • If the VM has a physical SCSI controller attached, for example on virtual Microsoft Cluster nodes, or has a VMDirectPath device attached (which gives the guest direct access to a PCI device on the host).
  • Where the target host has insufficient resources to honour the guest’s requirements AND strict admission controls are in place for the cluster.

There are two priorities of VMotion available – High and Low.  This is about protecting performance of the guest, not of the migration.  A High Priority migration reserves sufficient CPU cycles on the target host to satisfy the guest’s requirements, otherwise it will abort the migration.  A Low Priority migration will go ahead regardless of target host CPU utilisation.

As VMotion transfers the memory state of the guest to the target host, a guest with 64Gb of RAM will take significantly longer to migrate than a guest with 1Gb of RAM. 

It is possible to run 4 concurrent VMotions on vSphere 4.1 hosts with Gigabit networking – if you’re lucky enough to have deployed 10Gb networking you can run up to 8 VMotions at the same time.

How Does VMotion Tie In With Other Features?

Putting a host into Maintenance Mode initiates the VMotioning of all running guests off that host.

DRS (Distributed Resource Scheduler) provides automated balancing of resources across all hosts in a cluster.   When enabled, vCenter analyses host utilisation every 5 minutes, and if a host is deemed significantly busier than the rest it will initiate VMotions of one or more guests off that host to lighten the load.

HA (High Availability) is a technology that monitors host availability, responding to failures to ensure a failed host’s virtual machines are brought online quickly on a working host.   It doesn’t use VMotion to achieve this.

So What Is Storage VMotion?

Storage VMotion is a separate feature for Enterprise or Enterprise+ hosts that provides the ability to move a running guest’s data files from one datastore to another.  This feature is fantastic for SAN migrations or maintenance work.   Migrations can also convert VMDK files between Thin and Thick formats.

One word of caution:  Storage VMotion of guests with RDM (Raw Disk Mapping) disks attached will by default convert the RDMs into VMDK files!   If RDMs are deployed, use the Advanced mode in the migration wizard.

Storage VMotion won’t work for guests with snapshots in place, or if any disks are non-persistent. 

One final tip:  If taking advantage of Storage VMotion it is worth checking whether your SAN is VAAI capable, in which case the vCenter can talk with the SAN to offload some disk actions such as this,  improving performance.

Where Do I Go From Here?

Introduction to VMotion:  http://www.vmware.com/products/vmotion/features.html

Configuring VMotion Networking:  http://www.youtube.com/watch?v=VaGtMtYA6H0

VMotion Compatibility Guide for Intel processors: 

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1991

VMotion Compatibility Guide for AMD processors:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1992

Modifying the CPU mask:  

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1993

Virtualisation 101 – VMware Update Manager (VUM)

What Is It?

Update Manager is VMware’s patching product, and is used for updating ESX/ESXi hosts, virtual appliances and guest machines.  It is a companion product to vCenter and installed via the vCenter Installer.  In smaller deployments VUM would be installed on the vCenter server, but in larger environments could be run as a dedicated server.

The application can run scheduled download of patches from VMware and Shavlik (for Microsoft updates) and store them in a local repository.   Patches can also be imported from ZIP files, or via an intermediary machine running Update Manager Download Service (UMDS).

Management of VUM is done via a vSphere Client plugin, which adds an extra icon in the Solutions section, and an Update Manager tab to each vSphere object.   Patching is done in 3 steps :-

  1. 1.       Baselinesare created to set the types of patches to apply, which are then attached to objects within the hierarchy: datacentre, cluster, resource pool, folder, vApp or an individual host or guest.  Multiple baselines can also be aggregated into a baseline groups.
  2. 2.       A scanof the attached object compares the ESX/ESXi hosts, Windows or RedHat Linux guests against the baseline profile to report which are compliant or which have patches that need to be applied. 
  3. 3.       Finally you remediatethe ESX/ESXi host or Windows guests to apply missing patches.  VUM automatically places hosts into Maintenance Mode (providing DRS is set to Automatic) before patching begins to avoid any outages, and can apply snapshots to virtual machines before patching, retaining them for a predetermined period of time afterwards, providing a handy rollback option.

VUM installs a Guest Agent on Windows or RedHat virtual machines at the first scan or remediation to facilitate patching.   Please note that — as of v4.1 — VUM cannot scan non-RedHat distributions of Linux, nor can it remediate any Linux guests with OS patches, only applying updates for VM Hardware and VMTools.  

Windows guests from XP or above can have OS patches installed — even up to full Service Packs — while online or offline.  VUM can also upgrade their VM Hardware level and VMTools.

What Do I Need to Deploy It?

The good news is that you don’t need a great deal to deploy the latest version of VUM :-

  • A 64-bit virtual or physical Windows server(XP, 2003 or 2008) with at least 2Gb of RAM (if dedicated to VUM) – if running alongside vCenter you’ll want at least 4Gb of RAM.
  • A databasefor holding the application metadata – this could be a local SQL 2005 Express database for smaller implementations (bundled with it), or a local or remote MS SQL Server 2005/2008 or Oracle 10g/11g.   The database size can be determined with the Update Manager Sizing Estimator (see link below).
  • Disk space for the patch repository.  The amount of space required will vary depending on what is to be patched.  Again the sizing estimator can gauge disk space requirements, but as an indication the install will warn if less than 20Gb is free on the chosen volume.
  • Network connectivityand firewall access for the application to communicate properly with the vCenter server, database server, ESX/ESXi hosts and the Internet.  For full details of ports used see KB article 1004543 (link below).

Should I Use VUM Instead Of Microsoft WSUS?

In theory VUM’s patching technology is sound – Shavlik has been around since 1993 – and does the job well.  However, many companies choose to stick with WSUS as the infrastructure is already in place and works, or because Windows admins feel safer staying “in-house”.  

OS patching aside, VUM is still useful for upgrading VM Hardware and VMTools on Windows guests.

How Much Does It Cost?

VUM is covered by any level of vSphere licensing, so the only implementation costs apart from time and effort are potentially OS or database licenses. 

Where Do I Go From Here?

VUM Documentation Page:  http://www.vmware.com/support/pubs/vum_pubs.html

The official Install & Admin Guide:  http://www.vmware.com/pdf/vsp_vum_41_admin_guide.pdf

VUM Sizing Estimator:  www.vmware.com/support/…/doc/vsp_vum_40_sizing_estimator.xls

Network Ports Required:   http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004543

Video of VUM in action:   http://www.youtube.com/watch?v=PF3mo3Z3mI4

Virtualisation 102 – SRM

This article builds on the previous 101 overview of VMware’s Site Recovery Manager and covers the deployment process at a high level.  For detailed instructions on how to roll out SRM please refer to the documentation linked to in the previous article.

Preparation Phase

  • DR projects are notorious for soaking up engineer resource, requiring multiple tests and lots of beard-stroking to achieve a workable solution, so it’s vital to have a business sponsor to drive the project to completion.  Any half-hearted attempt at building a DR platform will fail.
  • Define the scope of what services – and therefore servers — need protecting, and be wary of scope creep.  Be certain the server list is correct.  (Don’t be surprised when testing proves that a long-forgotten legacy server is vital to the operation of your intranet!) 
  • Get quantifiable goals that the DR platform must meet, for example Recovery Time Objectives (RTOs). 
  • Set expectations of what SRM can do, especially the fact it can only protect virtual servers.  This could be a good motivator to P2V more servers, or require physical servers to have DR protection added as a pre-requisite, such as database log shipping. 
  • Ensure that VMTools is installed and running on all virtual servers to be protected, as SRM uses it for graceful shutdowns during failover.
  • Configure networking is set up at the Recovery site :-
    • Stretched VLANs – the easiest for server admins as current IP addresses can be used.
    • NAT – retain existing IP addresses for servers and change addresss via a firewall out to the rest of the network.
    • IP Customisation – SRM has the ability to alter IPs during failover.
  • Consider how Internet access will be cut over – do your web/mail servers need their public DNS addresses changing?  How will users access services from home or DR site?
  • Unless protecting the entire virtual estate the LUNs need to be split into two camps — “Protected” and “Unprotected”  – with all in-scope virtual servers moved onto Protected LUNs, the rest hosted on Unprotected.  This is because SRM fails over on a per-LUN basis.  Storage VMotion is great for performing storage realignment without downtime.
  • Deploy ESX servers at the Recovery site, with enough total capacity to run all of the protected servers at peak demand (e.g. over month-end).   Ask the storage admins to zone these servers to the Recovery SAN, along with at least one LUN for the cluster.
  • Deploy Domain Controller(s) & vCenter server at the Recovery site to manage the local cluster.  For medium-large sized deployments use a dedicated SQL server for the vCenter and SRM databases.
  • Arrange with the storage admins to have the Protected LUNs replicated over to the Recovery Site and presented to the DR ESX cluster.  (Don’t add these LUNs to the cluster yet, this is done by SRM during failover.)  Also ensure that Protected LUNs have 20% snapshot space available within their disk groups. 
  • While you’re annoying the storage admins ask them for an admin logins for both SANs, plus their management IPs.

Build Phase

  • Deploy the SRM database:  A database needs to be created manually at each site before the SRM installation is performed – step by step instructions for MS SQL Server, Oracle & IBM DB2 are provided in the Administration Guide.
  • Deploy the SRM server:  At each site build a new Windows virtual server, configure a static IP patch and join AD.  Create an ODBC connector to the SRM database.  Download the latest version of SRM and follow the installation wizard.  Although the SRM servers are paired up during configuration, during the install you point it to the local vCenter server.
  • Download and install the vendor-specific Storage Replication Adapter application on the SRM servers.
  • Install the vSphere Plugin for SRM on workstations used to manage the virtual infrastructures, including the vCenter servers themselves.  This adds an SRM icon to the Solutions & Applications section of the Home screen.
  • Connect the sites:   In the SRM plugin, supply details of the other site’s vCenter server.  Repeat at the other site.  Once the sites are connected you will be prompted for login details of the paired vCenter server whenever you enter the SRM plugin.
  • Connect to the Storage:   Supply details of the SAN’s management IPs and login credentials, to allow SRM to gather storage information and issue failover commands.
  • Inventory Mappings:   The next step is about matching up resources (clusters, folders, resource pools & network port groups) between Protected and Recovery sites, so that SRM knows how to assign resources at the recovery site. 
  • Create Protection Groups:   Protection groups are just that, groups of virtual servers to be protected within a Recovery Plan.  On the Protected site SRM server create a group for each LUN.  This will create a placeholder  object for each protected VM at the Recovery Site.
  • Create Recovery Plans:   Once created the protection groups – and the status of the virtual machines within them – is visible on the Recovery Site’s SRM configuration.  Create a Recovery Plan and select one or more of these groups to include within the plan.  You can customise the Plan to suit your requirements, for example setting the shutdown and startup order of the servers – eg bring the database servers down last and up first – defining application timeouts and mapping networking for Test failovers.

Testing Phase

  • The first test is simply to verify that failover will actually work.  This can be done easily and without disrupting production services using the Test Failover option within SRM.  What this does is attach the Recovery Site LUNs to the ESX cluster and power up the VMs on a vSwitch that has no physical NICs attached, thereby completely isolated from the Real World.   DR server health can be checked via Console, and when done the Test Failover will return everything to the original state.
  • Multi-Tiered Network Test Failover:  When configuring the Recovery Plan you can map each server’s vNIC to any port group on your Recovery Site cluster, so it’s possible to create your own isolated vSwitch as a replica of your production network, using port groups and virtual appliance firewalls such as Smoothwall, to simulate your various segments.
  • Once you are confident that the failover works you need to schedule a Full failover.  This involves a lengthy outage of production services and will therefore be high profile and carry some degree of risk, so test plans should be clearly documented and approved by management via your change control process (trans: cover your butt!).
  • An important point relating to Full failover is that there is no failback option.  To return to original working configuration you will need to repeat the failover process in the opposite direction.  SAN replication must be set up in the opposite direction — which could take a long time to complete – and fresh Protection Groups & Recovery Plans created before failover can be performed back to the Production environment, incurring a second outage.  This greatly increases the time and effort required to perform the whole test, so build in plenty of contingency time for testing and troubleshooting – there’s nothing worse than a ticking clock when performing a high-pressure piece of work.   Unfortunately, without performing a full test in each direction you cannot be certain that it will work should it be required.
  • Report back to the sponsor on how the failover went, with proof that applications all worked correctly, and get written signoff to confirm that the solution meets the scope and goals of the project.

Other Considerations

  • SRM only provides the technical means to fail over the critical virtual machines – to create a credible Business Continuity Plan it needs to be wrapped up in a process to determine such things as who has the authority to order a failover;  circumstances in which failover is required;  who can perform the failover, contact details, documentation library, etc.
  • An enterprise IT infrastructure will probably involve physical servers that provide business-critical services – Oracle databases, mainframes, firewalls to name but a few – that will impact on BAU planning.  SRM setup is likely to be one strand of many.
  • Bringing virtual machines online at the DR site is one thing, getting multi-tiered applications to work is quite another!  There may be lots of tinkering required to get things working – DNS or hosts file changes, ODBC connections, certificate errors and the like – particularly if required to run with different IP addresses.  Ensure these post-failover tasks get added to the failover documentation.
  • There should be thought given to how to run the Recovery site as Live over a prolonged period, involving backup & restore, monitoring & management.   It will greatly help in a disaster to replicate your IT ops data out to a share at the DR site.
  • It is important the DR platform not just be ignored after signoff.  It is integral to the IT infrastructure and should be integrated into the Change process, so that changes to protected services, new services deemed in-scope for DR protection, or even SAN or SRM upgrades are managed in a manner that retains integrity of the platform.  There also needs to be regular testing scheduled, even if only once a year, to validate the platform, quality of documentation and remind engineers of how the environment works.

Virtualisation 101 – SRM

This is the first in a short series of high level articles intended to explain how some of VMware’s products and features work, to help anyone that needs to get up to speed quickly on a particular topic.

What Is It?

Site Recovery Manager is a VMware product that enables protection of a virtual infrastructure at a site level.  Once configured, SRM can orchestrate the recovery of a failure of the primary site, bringing virtual servers online quickly, and in a predetermined order, at the recovery site.

What Do I Need to Deploy It?

  • A compatible SAN – The SRM application uses a vendor-specific Site Recovery Agent (SRA) to talk to the SAN.  Check the SRM Storage Compatibility Matrix to ensure your SAN is on the list (link below).
  • Two SANs  –  Replication of data is done by the SANs, not by SRM, so the Recovery Site requires a SAN of the same type as the Protected Site SAN  that has sufficient spare capacity to host a mirror copy of the Protected Site’s data.  You can decide on a per-LUN level whether to replicate or not.  But remember you will need more storage.
  • SAN Replication  –  Terminology will vary between vendors, but basically a Master/Slave setup is required between the SANs, with all changes on the Protected SAN copied over — as close to realtime as possible – to the Recovery SAN.  This means you’ll need a fast link between sites and to pay for SAN licensing to enable mirroring between SANs.
  • Good relations with your SAN engineers  –  the SRA requires an admin login to the SAN in order to interrogate it and initiate failovers.
  • Recovery Site resources  –  there needs to be ESX or ESXi servers at the Recovery Site with enough resources to run your protected servers.  The Recovery site also requires its own vCenter and SRM server, linked to those at the Protected Site, along with any supporting servers such as Active Directory and DNS.

How Much Does It Cost?

The core SRM product licensed in units covering 25 managed servers, with a list price of US $11,250 per 25 VMs. 

Other things to consider when budgeting for SRM:

  • Remember the prerequisites such as SAN replication licensing;
  • Extra Storage;
  • Adequate WAN links for SAN replication;
  • Provisioning ESX resources at the Recovery Site;  
  • Windows licenses for two vCenter servers and two SRM servers. 
  • The Requirement for Two vCenter servers and two SRM servers 
  • Larger scale deployments should also consider storing the SRM database externally.

Where Do I Go From Here?

SRM Storage Compatibility Matrix:  http://www.vmware.com/pdf/srm_storage_partners.pdf

VMware SRM Docs page:  http://www.vmware.com/support/pubs/srm_pubs.html

Administating SRM: http://www.lulu.com/product/file-download/administering-vmware-site-recovery-manager-40/6522403, Mike Laverick’s eBook is essential reading before deploying SRM; it costs less than £7 and all proceeds go to Unicef, you can get a hard copy too.

VMware Communities page for SRM:  http://communities.vmware.com/community/vmtn/mgmt/srm, seriously folks I cannot state how useful this resourse is in your first steps to getting to grip with the product.  it should also be your fisrt stop when troubleshooting.

The next step in this series is a 102 post on how to setup working SRM evironment.