Expanding a disk on a virtual machine is a relatively easy and common daily task to do within your VMware environment. So it shocked me when I tried to expand the disk on a production VM, and the process was completed with an error but also powered the VM off.
Luckily I was on the console and saw that the console window went black. I tried to establish an RDP connection to it because my next step was to expand the disk within Windows, and I found that I could not connect.
My next step was to go back to vCenter Server and find the VM, only to confirm what I had seen seconds earlier: The VM was indeed powered off.
I quickly powered up the SQL server and notified the DBA that had requested the extra disk space. Since it was lunchtime, no end-users noticed the server go down, but our network monitoring did, and there were plenty of emails alerting the server down. Again, it was lunchtime; there was no fuss.
Here are the errors seen from the VMware Center Server for the VM (not the real name in the screenshots) I was having an issue with:
Details: Reconfiguring Virtual Machine on destination host
Status: Disconnected from virtual machine. Remote connection failure Failed to establish transport connection.
Error stack: Failed to establish transport connection.
Disconnected from virtual machine.
Description: Task: Reconfigure virtual machine
According to VMware support, although the specific issue I experienced was not specifically called out in one of their KB articles (https://kb.vmware.com/s/article/90283), this issue is related to a new ioFilter (named vmwarelwd), that is leveraged by VMware Cloud Disaster Recovery (VCDR) and other data protection solutions.
We had attempted to use VCDR some time ago, but had to pull the plug because it could not satisfy some of our requirements. Even though we had cleaned up the cloud configurion, removed the VMs from the protection policies, and deleted the on-prem appliances, VCDR left some configurations on our VMs which triggered the issue we were experiencing.
While the KB article is clear on removing the filter from the VMs, take note that doing so will reset CBT, causing your next backup to be full instead of incremental or differential, depending on what type of backups you have configured in your environment.
Support also called out the events shown in the logs that showed what was causing the issue, as shown here:
- 2023-06-26T19:18:32Z[+0.000] In(05) vmx lely4gdd-1853790-auto-13qe7-h5:70212649-25-01-a7-0d16 SymBacktrace 0000035433a3df90 rip=000000000031300f
- 2023-06-26T19:18:32Z[+0.000] Cr(01) vmx lely4gdd-1853790-auto-13qe7-h5:70212649-25-01-a7-0d16 PANIC: Unexpected signal: 11.
- 2023-06-26T19:18:32Z[+1.092] Wa(03) vmx lely4gdd-1853790-auto-13qe7-h5:70212649-25-01-a7-0d16 A core file is available in "/vmfs/volumes/614cf43d-05f0bc3c-ee20-f8f44ea47fa3/PRODSQL01/vmx-zdump.001"
- 2023-06-26T19:18:34.452Z Wa(03) mks - Panic in progress... ungrabbing
- 2023-06-26T19:18:34.452Z In(05) mks - MKS: Release starting (Panic)