Recently, I’ve been testing a few (seven) Pacemaker clusters. It’s a lot more complicated than the olden days of Heartbeat, when clustering was basically as simple as pushing out a configuration file and a resource description file to the cluster engine and letting it go to work. So far, though, it’s really been working pretty well. Most of the problems you encounter with it can be resolved with a simple crm resource cleanup <resource_group_name> once you’re sure that the underlying issues (disk, network, whatever) have been resolved.
This weekend, though, I ran into an interesting observation: even though none of the services require LDAP, the server itself is configured to use it for authentication in PAM/nsswitch. That server went down, and because of how it went down (the service hung hard rather than crashing outright, leading to lots of timeouts) Pacemaker started to run into problems. Specifically, every single service that runs a process on every single server attached to this LDAP server went down. Every single one of them required manual intervention to bring back online after LDAP functionality was restored.
Apparently, all of the status checks for the managed cluster services timed out, presumably while trying to look up group membership information for the running processes associated with the cluster-managed services. Pacemaker, not liking it when status checks time out, safely assumed that something must be wrong with the process. It terminated the service, and handed it over to the failover node in the cluster, which promptly did the exact same thing with it.
Lesson learned about stability testing: it’s important to test that a system will still function when dependent services exhibit broken behavior, instead of just being outright inaccessible. In this instance, everything worked fine when the server was flat-out down — the LDAP connections just timed out, the system picked up the local configuration from passwd/group, and everything was peachy. However, I failed to consider what happens when the connection works fine, but the service on the other end just isn’t doing anything.
The other lesson learned: when you’re dealing with centralized services, make sure your high availability solution is configured absolutely correctly, and is resilient enough to deal with multiple kinds of application failures. Many service issues can have far broader implications than you imagine.
0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.