Resilient infrastructures are only useful if they actually stay resilient

Ask yourself a question: for every piece of resiliency you supposedly have in your network, are you really positive that it’s not running in a degraded state? Really, really sure?

Sometimes, it’s basic: are you being alerted when any disk array attached to any server suffers a disk failure?

Very often, it’s not: for your SAN-attached systems, are you positive that the multipathing is green? If you’re connected to two storage processors or controllers, can the server see two paths to each of them? Are you getting alerted if you can’t?

Are your port channels running over the number of links that they’re supposed to? How about the ISLs on your FC fabrics?

If you have failover clusters where services run on preferred nodes, are you sure they’re actually located where they’re supposed to be? Are you monitoring that services are all running on their preferred nodes?

If you have asymmetric fall-back connections, like a gigabit switch uplink used to back up a 10-gigabit switch uplink, are you notified when it’s using the backup connection, or do you rely on your users to tell you that things seem to be running slowly?

There’s a difference between things running, and things running smoothly: making sure that your “redundant” equipment and services are actually redundant is the key to keeping issues from turning into problems.

2 Comments

  1. The key to any kind of redundancy is to use it! Whether it becomes part of your deployment/upgrade procedure, failover is periodically forced, or something else, waiting for the “big event” is asking for trouble.

    • So true! I’ve seen so many people that are afraid to pull a fiber cable out of a live production system. What’s the problem? It’s redundant, right?

      You need to be 100% confident that your HA features work. Otherwise, what’s the point of having them?

Leave a Reply

Your email address will not be published.

© 2019 @jgoldschrafe

Theme by Anders NorenUp ↑