Ask yourself a question: for every piece of resiliency you supposedly have in your network, are you really positive that it’s not running in a degraded state? Really, really sure?
Sometimes, it’s basic: are you being alerted when any disk array attached to any server suffers a disk failure?
Very often, it’s not: for your SAN-attached systems, are you positive that the multipathing is green? If you’re connected to two storage processors or controllers, can the server see two paths to each of them? Are you getting alerted if you can’t?
Are your port channels running over the number of links that they’re supposed to? How about the ISLs on your FC fabrics?
If you have failover clusters where services run on preferred nodes, are you sure they’re actually located where they’re supposed to be? Are you monitoring that services are all running on their preferred nodes?
If you have asymmetric fall-back connections, like a gigabit switch uplink used to back up a 10-gigabit switch uplink, are you notified when it’s using the backup connection, or do you rely on your users to tell you that things seem to be running slowly?
There’s a difference between things running, and things running smoothly: making sure that your “redundant” equipment and services are actually redundant is the key to keeping issues from turning into problems.