MCollective, RabbitMQ, and the Case of the Missing Pings

I like robust management infrastructures. They make me happy. But sometimes, tiny behaviors can send you on a wild goose chase.

Being fairly inexperienced with both MCollective and RabbitMQ, though, I ran into an interesting issue with ours off and on over the last couple of weeks. One night, our MCollective installation, which had been working fine for weeks or months, started to exhibit the following behavior from our control node:

  1. Issuing an mco ping would return a list of all the nodes in our environment.
  2. Issuing another mco ping would cause no nodes at all to turn up.
  3. Restarting the MCollective agent on any one node would cause that node to show up in the next mco ping, but not any subsequent one.
  4. Any activity besides mco ping would fail.

This would continue for a little while, then magically resolve itself until it would randomly present itself again a few days down the road.

Turning up the MCollective logging level on both the client and server, I could see that the agent was putting messages into the reply queue, but the client wasn’t receiving them, with no good indication why.

Digging deeper, I ran netstat -an to look into the connection state. I saw high Recv-Q and Send-Q counters associated with the connections, so epmd (Erlang’s TCP multiplexer, not Erick and Parrish Making Dollars) wasn’t even pulling the data out of the socket. I took a look at some traffic dumps of MCollective running with a single agent, with the aes_security plugin disabled to make the payload easy to inspect, but that didn’t reveal much either because Wireshark doesn’t have a dissector for STOMP MQ.

So, I set up RabbitMQ on a temporary system to see what would happen. To my chagrin, that system’s MQ worked just fine. I poked around the logs on our production Puppet/MCollective/RabbitMQ system and found nothing of any value besides a bunch of notices that nodes had connected.

Since we recently upgraded the whole VMware environment that houses Puppet, MCollective and most of our other tools, I started to look into everything else. I upgraded the virtual hardware, VMware Tools, and accompanying drivers trying to figure out if it was related to our recent ESXi upgrade from 4.1 to 5.1. With the problem still occurring, I dumped the paravirtualized vmxnet3 driver entirely in favor of the standard emulated Intel e1000 driver. No dice. netstat continued to show high Recv-Q and Send-Q and the RabbitMQ management interface showed no messages traversing the system.

Getting more frustrated, I completely trashed the RabbitMQ configuration and set it up again from scratch, which, it turns out, didn’t help at all. mco ping, one response. mco ping again, no response. Restart the MCollective agent and mco ping again, one response. In a last-ditch effort, I updated MCollective to 2.3.1 (development) and RabbitMQ 3.0.3 (stable, released literally that day) and tried again. No luck.

Doing a bunch of digging, and asking others for their thoughts, the consensus was that RabbitMQ was deliberately dropping connections for some reason. Finally, I stumbled upon this stupid thing:

Disk-Based Flow Control

It turns out I didn’t have enough disk free on the host. Because of disk hot-grow quirks in Linux, we have Linux VMs with very small root partitions (5 GB) and separate partitions for data volumes (/var/lib/mysql, etc.), and having less than 1 GB free on the system is a really common occurrence. It turns out that the default RabbitMQ configuration doesn’t like this very much, and will throttle producers with the exact behavior that I was seeing earlier.

Dear RabbitMQ devs: a log message would be lovely when you start throttling messaging because of resource usage, thanks.

5 Comments

  1. Umm, we do write a log message when the disk alarm goes off:

    =INFO REPORT==== 19-Mar-2013::14:40:19 ===
    Disk free space insufficient. Free bytes:377729826816 Limit:2000000000000

    =WARNING REPORT==== 19-Mar-2013::14:40:19 ===
    disk resource limit alarm set on node rabbit@smacmullen

    Maybe this should try to sound scarier? Any other ideas for how we can make it clearer the alarm has gone off?

    • The problem isn’t that it’s unclear the alarm has gone off. The problem is that the particular messaging chosen here doesn’t make it clear that this is actually a bad thing that will bite you immediately and not at some arbitrary point down the road when you actually run out of disk space.

      1. “Disk free space insufficient” is a thing that impacts actual operation of a production service once rate limiting kicks in, but it’s being logged as INFO instead of WARNING or ERROR. This is kind of weird.
      2. “Alarm” is an overloaded term. Especially for people coming from something like a vSphere background, that word means “here’s a thing that you should be aware of.” That’s great — I’m totally aware of how much free disk my system has, because I’m monitoring it with Nagios or something like everyone else. Thanks for watching my back though, RabbitMQ! What RabbitMQ should be saying instead is something incredibly opaque like “hey jerk, even though there’s really no technical reason why we can’t, we’re deliberately not going to deliver your messages until you free up some disk space.”

      • We’ve changed the code so future rabbit versions will make it clear in the alarm message what the consequences are.

        As for ” even though there’s really no technical reason why we can’t, we’re deliberately not going to deliver your messages until you free up some disk space.”… Note that RabbitMQ will happily continue to deliver messages to consumers even when alarms are active. However, it will stop reading from any message-producing clients, since there is a good chance that messages marked as ‘persistent’ can no longer be written to disk and that, furthermore, paging messages to disk in order to free up memory is likely going to fail.

  2. Ok, this is a VERY unhelpful error condition. For me, the symptom appeared as “My celery task is hanging. Why is it hanging? It’s not even hitting the timeuot – why on earth is the timeout not being respected? Is the backend wrong/ No, the backend is set correctly. Is celery raising any errors? No, no errors there… What on earth could it be?”

    Anyway, a fix is to create /etc/rabbitmq/rabbitmq.config:
    [
    {rabbit, [{disk_free_limit, 10}]}
    ].

    And then to ensure it’s picked up:
    export CONFIG_FILE=/etc/rabbitmq/rabbitmq.config
    service rabbitmq-server start

    That makes rabbit only care if there’s a minimum of 10 bytes free (and if that triggers, you’ve got worse problems)

    The information on how to actually set that sort of config option is very well hidden, I must say. And this blog post is practically the ONLY mention of the error message in the log that causes the failure. So thank you for the post.

Leave a Reply

Your email address will not be published.

© 2019 @jgoldschrafe

Theme by Anders NorenUp ↑