I like robust management infrastructures. They make me happy. But sometimes, tiny behaviors can send you on a wild goose chase.
Being fairly inexperienced with both MCollective and RabbitMQ, though, I ran into an interesting issue with ours off and on over the last couple of weeks. One night, our MCollective installation, which had been working fine for weeks or months, started to exhibit the following behavior from our control node:
- Issuing an mco ping would return a list of all the nodes in our environment.
- Issuing another mco ping would cause no nodes at all to turn up.
- Restarting the MCollective agent on any one node would cause that node to show up in the next mco ping, but not any subsequent one.
- Any activity besides mco ping would fail.
This would continue for a little while, then magically resolve itself until it would randomly present itself again a few days down the road.
Turning up the MCollective logging level on both the client and server, I could see that the agent was putting messages into the reply queue, but the client wasn’t receiving them, with no good indication why.
Digging deeper, I ran netstat -an to look into the connection state. I saw high Recv-Q and Send-Q counters associated with the connections, so epmd (Erlang’s TCP multiplexer, not Erick and Parrish Making Dollars) wasn’t even pulling the data out of the socket. I took a look at some traffic dumps of MCollective running with a single agent, with the aes_security plugin disabled to make the payload easy to inspect, but that didn’t reveal much either because Wireshark doesn’t have a dissector for STOMP MQ.
So, I set up RabbitMQ on a temporary system to see what would happen. To my chagrin, that system’s MQ worked just fine. I poked around the logs on our production Puppet/MCollective/RabbitMQ system and found nothing of any value besides a bunch of notices that nodes had connected.
Since we recently upgraded the whole VMware environment that houses Puppet, MCollective and most of our other tools, I started to look into everything else. I upgraded the virtual hardware, VMware Tools, and accompanying drivers trying to figure out if it was related to our recent ESXi upgrade from 4.1 to 5.1. With the problem still occurring, I dumped the paravirtualized vmxnet3 driver entirely in favor of the standard emulated Intel e1000 driver. No dice. netstat continued to show high Recv-Q and Send-Q and the RabbitMQ management interface showed no messages traversing the system.
Getting more frustrated, I completely trashed the RabbitMQ configuration and set it up again from scratch, which, it turns out, didn’t help at all. mco ping, one response. mco ping again, no response. Restart the MCollective agent and mco ping again, one response. In a last-ditch effort, I updated MCollective to 2.3.1 (development) and RabbitMQ 3.0.3 (stable, released literally that day) and tried again. No luck.
Doing a bunch of digging, and asking others for their thoughts, the consensus was that RabbitMQ was deliberately dropping connections for some reason. Finally, I stumbled upon this stupid thing:
It turns out I didn’t have enough disk free on the host. Because of disk hot-grow quirks in Linux, we have Linux VMs with very small root partitions (5 GB) and separate partitions for data volumes (/var/lib/mysql, etc.), and having less than 1 GB free on the system is a really common occurrence. It turns out that the default RabbitMQ configuration doesn’t like this very much, and will throttle producers with the exact behavior that I was seeing earlier.
Dear RabbitMQ devs: a log message would be lovely when you start throttling messaging because of resource usage, thanks.