After an HA event or a network/storage outage with VMware ESX servers (3.5/4.1 alike), you may have a situation in which the VM is down and cannot be powered on, even if you try to migrate it, or to deregister/register it again on the Virtual Center.
On closer inspection, you might notice that the vswp file is still on the VM folder (a sign the VM might be still active somewhere), yet you cannot delete the file because it is “locked”. Actually, one of the ESX in the cluster owns the lock, even if the VM is not running.
So, how to understand what to do with several hosts in the cluster? Let’s find out.
First of all, we have to know which esx is preventing the poweron.
Log in whatever esx, and run:
tail -f /var/log/vmkernel &
Now go to the locked VM datastore, and try to run:
cat vmname.vmdk
You should get some errors referring to the lock, but, more importantly, some vmkernel logs, such as:
Apr 5 09:45:26 Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Lock [type 10c00001 offset 13058048 v 20, hb offset 3499520
Apr 5 09:45:26 Hostname vmkernel: gen 532, mode 1, owner 45feb537-9c52009b-e812-00137266e200 mtime 1174669462]
Apr 5 09:45:26 Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Addr <4, 136, 2>, gen 19, links 1, type reg, flags 0x0, uid 0, gid 0, mode 600
Apr 5 09:45:26 Hostname vmkernel: 17:00:38:46.977 cpu1:1033)len 297795584, nb 142 tbz 0, zla 1, bs 2097152
Apr 5 09:45:26 Hostname vmkernel: 17:00:38:46.977 cpu1:1033)FS3: 132:
Now, that part identifies the host locking the file. That bold part is nothing but the MAC Address of the ESX!
Now, to the boring part: you have to login in every esx of the cluster and check if any network card matches this MAC:
/sbin/ifconfig -a |grep -i 00:13:72:66:e2:00
As soon as identified, the host should be placed in maintenance from the Virtual Center (DRS should do all the work for migrating the virtual machines) and the rebooted. This will release any lock and allow the VM to be finally powered on.