How to map a raw LUN when ESXi doesn’t want you to…

VMware ESXi is a very popular hypervisor and the underpinning of the vSphere virtualisation suite which caters for enthusiasts up to large scale datacenter operations for enterprise and service providers. VMware offers a VMware logovariety of software to achieve virtualisation, Workstation, Server, ESX and ESXi being the main flavors with very different focusses: Desktop use, existing server, with a Linux console and without. ESXi is the variant with the smallest footprint and is ideal if you don’t need to implement a bunch of monitoring or other scripting on the host, are going to do all that management stuff elsewhere, or just don’t need to do any. As with the trusty old VMware Server, ESX and ESXi are available with a free license that gives you lots of room to play with but removes some of the more advanced features (like being able to move VM’s between hosts whilst running).

Playing with your LUNs

I’ve been a long-time VMware server user, since before ESX existed. In reality, I’ve used virtualisation since before the current market for it existed – my undergraduate university software project at UKC in the early 90’s was as part of a three-guy team to build a virtual machine monitor (VMM) – what is now called a hypervisor. When VMware got started they even emailed me a couple of times to talk about it. Most recently I’ve been running the 64bit variant of VMware Server 2 on my previous local NAS/VM host box which runs Ubuntu. This machine had roughly 6TB of space to play with and VMware could use as much or as little as I wanted.

I recently upgraded hardware and decided to give ESXi 4.1 a go, friends having reported positive things and finding the concept appealing (since it mirrored exactly what I did almost 20 years ago.) One key quirk is a limitation on file system size. In general, VMware provides virtual storage space by mapping the space occupied by a file in the host machine via software to the logical storage provided by a virtual host bus adapter (HBA).» On ESX and ESXi the common approach is for VMware to format storage volumes» using their own file system format called VMFS, optimised for large blocks. For most file systems the maximum size of a single file system is a function of the size of its blocks – the smallest allocatable unit of space – since it addresses blocks with a number and numbers on computers have a finite range. Sometimes, file systems also impose limits on the block size and for VMFS this is 8MB, which translates into a 2TB file system limit. ESX and ESXi also support presenting a raw device, via its LUN» , to virtual machines – but only under certain circumstances, as discussed below.

Moreover, if you present a volume to ESXi (and I presume ESX) that is larger than 2TB in size it gets very confused and, in my case, refused to create a file system larger than ~900MB. This was less than optimal – I had 12TB of space and ESXi was being dumb with it. But, there are workarounds.

Workaround 1: Concatenation of VMFS volumes

First, and simplest, since ESXi has a 2TB limit, is to only present volumes to it that are 2TB or less. Depending on the underlying infrastructure there are a variety of ways to do this – with partitioning or with logical volumes on a hardware RAID controller» or SAN.» VMware can then install VMFS onto these volumes and you can present them to the VM.

Most operating systems offer tools to provide software-RAID. This is usually avoided for the higher RAID levels because of the computational requirements but if your underlying storage infrastructure is already providing the fault tolerance there will be only a minimal impact if you use RAID-0, or simple block-level striping. If your hardware already stripes you will probably want to concatenate rather than stripe but the concept is similar. For FreeBSD you can use memory disk (md) to do this, for Linux you can also use md or, alternatively, the Logical Volume Manager (LVM) will oblige and with Windows one can use dynamic disks and extend our NTFS volumes across them.

In my case, using Ubuntu Server, I opted for LVM and setting up a simple concatenated volume is really very easy. I created four 1.8TB volumes on the RAID controller, created file stores in ESXi, created the virtual disks on those file stores (VMDK) and mapped them to SCSI controller targets in the virtual machine configuration. Linux then sees these four volumes as ordinary SCSI devices, sdb, sbc, sdd and sde. Using the whole of these virtual devices to create one large volume group and then carving it up into two logical volumes for [cci]/home[/cci] and [cci]/data[/cci] is quite easy to do.

[cc lang=”bash”]
# Install LVM if you don’t have it already
sudo apt-get install lvm2

# Older Ubuntu needs this too:
sudo modprobe dm-mod

# Label the virtual devices as managed by LVM
sudo pvcreate -v /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Create a volume group with those devices
sudo vgcreate -v vmraw /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Create a /home volume
sudo lvcreate -n home -L 64G vmraw

# Use the rest of the space for /data
sudo lvcreate -n data -l 100%FREE vmraw

# Initialise the file systems
sudo mkfs.ext4 -j -L home /dev/vmraw/home
sudo mkfs.xfs -L data /dev/vmraw/data
[/cc]

And then add these lines to [cci]fstab[/cci]:

[cc title=”Append to /etc/fstab”]
LABEL=home /home ext4 defaults,acl 0 2
LABEL=data /data xfs defaults 0 2
[/cc]

Assuming you have [cci]/home[/cci] and [cci]/data[/cci] mount points already created, you can then mount them. Here’s some [cci]df[/cci] output after I loaded the filesystems with some data:

[cc title=”df -h”]
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 48G 3.7G 42G 9% /
none 2.0G 188K 2.0G 1% /dev
none 2.0G 0 2.0G 0% /dev/shm
none 2.0G 1.8M 2.0G 1% /var/run
none 2.0G 8.0K 2.0G 1% /var/lock
none 48G 3.7G 42G 9% /var/lib/ureadahead/debugfs
/dev/mapper/vmraw-home
63G 14G 47G 23% /home
/dev/mapper/vmraw-data
7.3T 3.1T 4.2T 42% /data
[/cc]

Job done.

Workaround 2: Concatenation of directly mapped LUNs

Wait a moment, doesn’t that all seem horribly inefficient?

File system

LVM

Linux HBA driver

Virtual HBA hardware

VMDK file

VMFS

ESXi HBA driver

Physical HBA hardware

Disks

That’s a lot of layers that all have to do significant amounts of work in an I/O stack – in particular in having to map block numbers from one view of the storage to another. Worse, most of it runs on your CPU, using up cycles that could be used doing useful work.

We can remove some of this by mapping the volumes from the RAID controller directly into the virtual HBA provided to the client. VM and ESXi provides a virtual hard disk type entitled “Raw Device Mappings” (RDM) which does exactly this. However they only make the feature available to remotely-attached storage (SAN or an HBA with storage attached to external connectors).

I’ve not seen any definitive reason why they imposed this limitation, there’s certainly no technical reason to. My suspicion is to prevent a good number of their customers from shooting themselves in the proverbial foot. By mapping storage in a virtual machine to a physical unit of storage you form a very strong anchor between the machine and that storage. A key part of vSphere, in the upper-echelons of the licensing map, is to be able to migrate your virtual installations between hosts at will. VMDK files nicely abstract this storage into a quantifiable object that VMware has complete control over. Volumes on physical hardware are beyond its control. So, to reduce support calls, I believe they disabled the feature for local storage.

However, it’s only disabled in-so-far-as being able to configure it with the graphical tools is concerned. Using remote command line tools or the tech-support CLI you can use the right magic to create the raw device mapping you need. There are two commands key to this: [cci]esxcfg-scsidevs[/cci] to list the path that VMware uses to reference all of the volumes presented to it by the hardware and [cci]vmkfstools[/cci] to create the mapping. On my system I renamed the volumes under the hosts Configuration → Storage Adapters menu in the vSphere client to make them easier to spot, and on this page you can see the LUN values for each volume. [cci]esxcfg-scsidevs[/cci] told me this:

[cc title=”/sbin/esxcfg-scsidevs -c”]
Device UID Device Type Console Device Size Multipath PluginDisplay Name
mpx.vmhba33:C0:T0:L0 CD-ROM /vmfs/devices/cdrom/mpx.vmhba33:C0:T0:L0 0MB NMP Local USB CD-ROM (mpx.vmhba33:C0:T0:L0)
mpx.vmhba33:C0:T0:L1 Direct-Access /vmfs/devices/disks/mpx.vmhba33:C0:T0:L1 0MB NMP Local USB Direct-Access (mpx.vmhba33:C0:T0:L1)
naa.6842b2b0229e1800145d5234061dc8ee Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800145d5234061dc8ee 139392MB NMP Bell Raid1 Boot
naa.6842b2b0229e1800145d523406237f53 Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800145d523406237f53 304128MB NMP Bell Raid0 Scratch
naa.6842b2b0229e1800146855550c728258 Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800146855550c728258 1906674MB NMP Bell Raid6 Data0
naa.6842b2b0229e18001468573b295e172f Direct-Access /vmfs/devices/disks/naa.6842b2b0229e18001468573b295e172f 1906674MB NMP Bell Raid6 Data1
naa.6842b2b0229e1800146857b730cd92a5 Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800146857b730cd92a5 1906674MB NMP Bell Raid6 Data2
naa.6842b2b0229e1800146857e0333bc891 Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800146857e0333bc891 1906674MB NMP Bell Raid6 Data3
naa.6842b2b0229e1800146858543a1fc2f1 Direct-Access /vmfs/devices/disks/naa.6842b2b0229e1800146858543a1fc2f1 1906686MB NMP Bell Raid6 Data4
naa.6842b2b0229e18001468587e3c9c498e Direct-Access /vmfs/devices/disks/naa.6842b2b0229e18001468587e3c9c498e 1909818MB NMP Bell Raid6 Data5
t10.DP______BACKPLANE000000 Enclosure Svc Dev/vmfs/devices/genscsi/t10.DP______BACKPLANE000000 0MB NMP Dell PERC H700
[/cc]

The four devices I am interested in are “Bell Raid6 Data2” to “Data5“. Take note of the paths listed under “Console Device“.

Next you need to find where the virtual machine configuration lives. Each volume that has been formatted with VMFS is mounted with a path that uses the same ID as its underlying volume – you can see them by typing [cci]mount[/cci]. In that list above, you can see the volume I labelled “Bell Raid1 Boot“. Well, I also labelled the VMFS file system with the same name, and ESXi conveniently installs a link from that name to where it is mounted, so I can [cci]cd “/vmfs/volumes/Bell Raid1 Boot/”[/cci] and end up in the root of that file system. My VM configuration is in a directory off of there and I change directory into that.

The next bit of magic is using [cci]vmkfstools[/cci]. In previous releases of VMware you had to create a raw mapping by cloning another virtual disk and importing it into the LUN. ESXi 4.1 lets you skip that fuss and do just what you want, thus:

[cc lang=”bash”]
/sbin/vmkfstools -r /vmfs/devices/disks/naa.6842b2b0229e1800146857b730cd92a5 -a lsilogic disk_1_2.vmdk
/sbin/vmkfstools -r /vmfs/devices/disks/naa.6842b2b0229e1800146857e0333bc891 -a lsilogic disk_1_3.vmdk
/sbin/vmkfstools -r /vmfs/devices/disks/naa.6842b2b0229e1800146858543a1fc2f1 -a lsilogic disk_1_4.vmdk
/sbin/vmkfstools -r /vmfs/devices/disks/naa.6842b2b0229e18001468587e3c9c498e -a lsilogic disk_1_5.vmdk
[/cc]

This creates four virtual disks that map directly to the logical units presented to us by the HBA.

These VMDK’s can be added to your virtual machine using the vSphere client in the normal way – just tell it that they are existing virtual disks and it works out the rest. It will list them as “Mapped Raw LUN” in the VM configuration. Just make sure you attach them to a SCSI controller of the same type above. Note: You used lsilogic, any reason why?»

After that, I use LVM again, in exactly the same way, to concatenate the volumes together in the VM client, and as a result a good amount of processing and block number mapping has been eliminated.

Workaround 3: Can I directly map a big LUN?

The next logical step in this train of thought is… why use LVM at all? If we can map LUNs on the host to LUNs in the VM, why not create one big volume on the hardware and pass this right through?

I do not have an answer other than “I believe it should work, but I’ve not tested it and anytime I search online I see lots of advice against it.”

Right now my suspicion is that VMware imposed a 2TB limitation for a reason and I suspect that either their physical or their virtual hardware drivers don’t work beyond 2TB. If that is so, then any mapped LUN will also fail because it will not be able to pass through commands with big enough a block number. Or perhaps, if the drivers could work, some of the ancillary features, like snapshots, can’t cope with it.

I may test it at some point, but I suspect the performance gains will be marginal and this setup performs really very well, currently.

Issues with using mapped LUNs

The obvious issue is that if the use of mapped LUNs on local storage is not supported then anyone doing it should not cry if/when it breaks. One very obvious thing to note is that in all of the vSphere configuration and summary screens none of the mapped LUNs appear as being in use making it potentially easy to make mistakes and lose data.

Also worth thinking about is that VMware is not going to be able to do anything sensible with snapshots – experiment with it, but don’t rely on it.

For this discussion a host bus adapter (HBA) is the hardware that provides access to storage devices, be that the SATA controller on the motherboard, a RAID controller, a Fibre Channel interface or their virtual equivalents provided in software by a hypervisor.Powered by Hackadelic Sliding Notes 1.6.5
There is much terminology when it comes to storage and it’s complicated by having multiple layers between the disk and whatever ultimately uses it. Volume is a term that most often refers to an administratively defined portion of some storage medium. Usually called a logical volume it’s function is to present what looks like a linear and contiguous amount of storage space and hide however it is constructed (which could be a single whole disk, partitions on a disk, a RAID system, a concatenation of various storage areas from various places and so on) from the system.Powered by Hackadelic Sliding Notes 1.6.5
Logical Unit Number which is a term used in storage to refer to the address of a storage volume on a given HBA. It’s use has however been generalised to also refer to the volume itself, wherever it may reside and it’s presentation to the system. Some people use LUN and volume interchangeable, though they refer to different, but related, underlying concepts.Powered by Hackadelic Sliding Notes 1.6.5
RAID is a mature set of storage technologies that provide varying levels of performance and fault-tolerance across multiple physical storage devices. A RAID controller is a device (which can be in dedicated hardware or in a general purpose software driver) that presents logical volumes to a system and hides the complexity of providing the performance or fault-tolerance. Often hardware controllers contain local cache memory that can provide significant mitigation of the performance penalties some of the fault-tolerant schemes introduce.Powered by Hackadelic Sliding Notes 1.6.5
Storage area network (SAN) is broadly speaking where the RAID controller sits close to the physical storage and the connection between it and the users of that storage is extended over a dedicated network, usually Fibre Channel or iSCSI. Often SANs provide advanced features such as sharing of a volume across multiple client devices simultaneously or greater degrees of fault tolerance.Powered by Hackadelic Sliding Notes 1.6.5
The LSI Logic virtual HBA can be virtualised into 64bit guests. BusLogic cannot. I’ve not tested the Paravirtualision controller.Powered by Hackadelic Sliding Notes 1.6.5