ESXi hostd Crash (5.1GA) Due to Leftover SNMP Traps

One day I tried to open console on a VM and I got the MKS error:

mks error ESXi hostd Crash (5.1GA) Due to Leftover SNMP Traps

I logged into the host and I tried to run an esxcli command and I saw the following:

~ # esxcli network
Connect to localhost failed: Connection failure

It looks like hostd wasn’t happy. I quickly restarted services:

~ # services.sh restart

and that brought back hostd and the MKS errors went away. I didn’t have time to find out what was going on, so I grabbed a log bundle:

~ # vm-support

I downloaded the log bundle from the ESXi host and put it on my laptop. Here is the file:

elatov@kmac:~/es$ls *.tgz
esx-prod-vmware01-2013-08-12--14.06.tgz

I extracted the file:

elatov@kmac:~/es$ tar xvzf esx-prod-vmware01-2013-08-12--14.06.tgz

I then put the log files together since they get split apart to save space when the log bundle is generated:

elatov@kmac:~/es$cd esx-prod-vmware01-2013-08-12--14.06/
elatov@kmac:~/es/esx-prod-vmware01-2013-08-12--14.06$./reconstruct.sh

First I wanted to see what time we ran the service restart, here is when that happened:

elatov@kmac:~/es/esx-prod-vmware01-2013-08-12--14.06/var/log$grep 'services.sh restart' shell.log  | tail -1
2013-08-12T13:55:30Z shell[3389987]: [root]: services.sh restart

Around that time I saw the following in the var/log/hostd.log file:

2013-08-12T13:48:25.650Z [6FE81B90 info 'ha-eventmgr'] Event 8573 : The file table of the ramdisk 'root' is full.  As a result, the file /var/run/vmware/tickets/vmtck-5206bb19-94d6-12 could not be created by the application 'hostd-worker'.

There was actually an event fired for the “ramdisk” is full:

2013-08-12T13:48:25.645Z [6EEC2B90 error 'Cimsvc'] Connect: Failed to get ticket, not connected to cimom
2013-08-12T13:48:25.646Z [6FE81B90 info 'VmkVprobSource'] VmkVprobSource::Post event: (vim.event.EventEx) {
-->    dynamicType = <unset>,
-->    key = 825110831,
-->    chainId = 808464928,
-->    createdTime = "1970-01-01T00:00:00Z",
-->    userName = "",
-->    datacenter = (vim.event.DatacenterEventArgument) null,
-->    computeResource = (vim.event.ComputeResourceEventArgument) null,
-->    host = (vim.event.HostEventArgument) {
-->       dynamicType = <unset>,
-->       name = "prod-vmware01",
-->       host = 'vim.HostSystem:ha-host',
-->    },
-->    vm = (vim.event.VmEventArgument) null,
-->    ds = (vim.event.DatastoreEventArgument) null,
-->    net = (vim.event.NetworkEventArgument) null,
-->    dvs = (vim.event.DvsEventArgument) null,
-->    fullFormattedMessage = <unset>,
-->    changeTag = <unset>,
-->    eventTypeId = "esx.problem.visorfs.ramdisk.inodetable.full",
-->    severity = <unset>,
-->    message = <unset>,
-->    arguments = (vmodl.KeyAnyValue) [
-->       (vmodl.KeyAnyValue) {
-->          dynamicType = <unset>,
-->          key = "1",
-->          value = "root",
-->       },
-->       (vmodl.KeyAnyValue) {
-->          dynamicType = <unset>,
-->          key = "2",
-->          value = "/var/run/vmware/tickets/vmtck-5206bb19-94d6-12",
-->       },
-->       (vmodl.KeyAnyValue) {
-->          dynamicType = <unset>,
-->          key = "3",
-->          value = "hostd-worker",
-->       }
-->    ],
-->    objectId = "ha-eventmgr",
-->    objectType = "vim.HostSystem",
-->    objectName = <unset>,
-->    fault = (vmodl.MethodFault) null,
--> }

I found this VMware KB, notice some of the symptoms list the following:

Opening a virtual machine console from the vSphere Client fails with the error:

Unable to contact the MKS

That is what we saw. Apparently this is caused by enabling SNMPD. From the same KB:

This issue occurs when SNMPD is enabled and the /var/spool/snmp folder is filled with Simple Network Management Protocol (SNMP) trap files.

Looking over the esxi host, I did see SNMPD enabled:

elatov@kmac:~/es/esx-prod-vmware01-2013-08-12--14.06$cat etc/vmware/snmp.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<config>
 <snmpSettings>
   <enable>true</enable>
   <port>161</port>
   <syscontact></syscontact>
   <syslocation></syslocation>
   <EnvEventSource>indications</EnvEventSource>
   <communities>public</communities>
   <targets>prod-vmware01@161 public</targets>
   <engineid>0000006300</engineid>
   <loglevel>info</loglevel>
   <authProtocol></authProtocol>
   <privProtocol></privProtocol>
 </snmpSettings>
</config>

From KB we can double confirm if that is issue by running the following:

ls /var/spool/snmp | wc -l

Note: If the output indicates that the value is 2000 or more, then this may be causing the full inodes.

Running that on the host, we saw the following:

~ # ls /var/spool/snmp | wc -l
5385

That confirmed that we are running into that issue. This is fixed in 5.1u1, again from the KB:

This issue is resolved in ESXi 5.1 Update 1

Checking over the current version that we were on, I saw the following:

elatov@kmac:~/es/esx-prod-vmware01-2013-08-12--14.06$cat commands/vmware_-vl.txt
VMware ESXi 5.1.0 build-799733
VMware ESXi 5.1.0 GA

For now we disabled SNMP logging and also cleaned up the left over SNMP files:

# cd /var/spool/snmp
# for i in $(ls | grep trp); do rm -f $i;done

After we update to 5.1U1, I am sure the issue will be fixed.