Mar 31

On MPIO and In-Guest iSCSI

I recently had a “stimulating discussion”* with Tom about MPIO initiated from a virtualized guest OS to a new SAN I am installing, and the pros and cons there of. This is one of those situations where the constraints of the technology make sense in a physical situation, but look a bit odd virtualized, so I wanted to ensure that I had the best possible solution for my needs, with the least extra effort, and importantly, some logic to back up the decision.

What is MPIO?

MPIO stands for MultiPath Input Output. It could be tagged as “For when one cable simply isn’t enough!”. This is the storage version of what at the network level we call “Link Aggregation”. The simple idea is to take multiple connections (be they Fiberchannel, SAS or iSCSI) between the SAN and what-ever is consuming it, and use them all. This can give you resiliency (Failover, failback) or increased bandwidth, through Round Robin use of Paths, or other options depending on what is available in the implementation.
My use case is iSCSI over standard Ethernet connections, so that is what I’m going to focus on.
The first thing to work out is what is a “Path” in order to work out how to get multiple paths. At the IP level a path is defined by two IP addresses, a source and destination. The network underlying that is irrelevant, it could be 100Mb/s Fast Ethernet (shudder) or 10 or 100Gb/s intercontinental Fibre. It could consist of 1 physical cable, multiple cables connected by switches, multiple cables between multiple switches etc etc. It really doesn’t matter. What matters is that the storage presents multiple IPs, in order to get Multiple Paths. Obviously, the same applies to the consumer. In order to get the full benefit of Multipathing, it must have multiple Paths, so multiple IPs.
Physical MPIO
So, you have a SAN with 4 IPs and 4 NICs dedicated to block traffic. You have one or more Physical Servers, which have 4 free NICs for iSCSI traffic. You now have two choices.
You could aggregate the links from the SAN to the Switch, and from the Servers to the Switch using LAGs. This would be dependant on your Switches supporting that (probably true) and on your guest supporting that (True from windows Server 2012 upwards, dependent on NIC drivers before that, but probably true).
Or you could leave the connections simple and use MPIO to distribute traffic over the links.
Do you want the Servers/SAN to handle the paths, or do you want the switch to do it?
There is a reasonable chance that MPIO will give better performance than a LAG, after all, if the SAN software can control the path end to end, it has more options and more information to order the flows with than a switch would. That said, I’ve done no testing, and would recommend that anyone who needs to make this decision to do that testing.
Physical Host
Host Level MPIO
Host Level MPIO works in exactly the same way as Physical outlined above. With the exception that once LUNs are mapped, they are formatted with VMFS and so passed up to the guests. This implies that there will be many guests using each LUN, and so a few connections.
Guest Level MPIO
In this situation, the Physical Host has no idea about the SAN what so ever. iSCSI traffic is just more VM traffic going out over vSwitches and into the network. It is arguable weather that is a sensible plan in the first place, but lets assume that for whatever reason, you want to do in Guest IO.
In our situation we would need to give the VM multiple vNICs to gain the advantage of MPIO, at least two vNICs, but preferably four. To each VM. We have a moderately small farm of 200 VMs, that’s nearly one thousand vNICs, which all need IPs on the storage network!
Lets look again at what MPIO gives us, failover and more bandwidth.
More Bandwidth in both the LAG and the MPIO situations, a single Path can only use the maximum bandwidth of that path, not the aggregate bandwidth, and one “conversation” will be limited to, in our case 1Gb/s despite the four NICs. If we are using multiple LUNs per VM (we are) then we have the possibility to utilise all the bandwidth, but that is true of both LAG and MPIO.
Resiliancy In both MPIO and LAG we get resiliency from having multiple connections. Again there is no clear winner.
The real test though is to consider the VM as a whole. In the physical server situation it is assumed that None SAN traffic is being aggregated in some way on “other” NICs. With our VM this is potentially true also, but in all likely hood that resiliency is being granted at the vSwitch level, by a LAG. If we already have a LAG in place, will the MPIO give us any advantages?
As stated above there is a chance that MPIO will give better utilisation, but this is less likely with “in guest”, as there are many more “conversations” than in the Physical or Host situation
We use in guest iSCSI for a few reasons, that mainly relate to recovery time objectives and the ability to snapshot subsets of our data. My conversation with Tom centered around the need to add, in our case 8 vNICs to each VM in our farm, which seemed excessive (bear in mind this is as well as any NLB, DAG, Clustering vNICs). No single VM is using 1Gb/s, in fact the Phsycial Hosts only have 2 1Gb/s NICs connected to the SAN switch anyway.
We could put two NICs in each VM, and then track which VMs see which Paths, but that seems like an awful lot of administrative effort. It *would* break at some point.
We could put a single NIC in each VM, and leave the resiliency to the Host. Again we’d have to track which VMs saw which Paths.
Or we could just utilise the LAG we have in place, and not confuse matters with MPIO. That is the route we have ultimately chosen.
* It wasn’t quite an argument, more a challenge to Tom to defend a position while I attacked it. Obviously, I lost. I always do 😉

1 comment

1 ping

  1. Darren DeHaven

    It seems the main thing is host routing (port id versus IP hash). When using IP hash be sure the IP addresses on a vm, for 2 nics, 1 even, and 1 old. The easy way is assign them in sequence. If you have only random IPs available, and more than 2 nics, then use the modulus calculation on the last octet. If a lag wasn’t an option, then the other way would be to create 2 port groups, and assign a vnic to each. Each port group would have a single unique uplink. Remember you can have multiple port groups in the same network.

  1. Technology Short Take #40 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, networking, cloud, servers, & Macs

    […] discussion on MPIO and in-guest iSCSI is a great reminder that designing solutions in a virtualized data center (or, dare I say […]

Comments have been disabled.

%d bloggers like this: