Home › Tag Archives › san

Update: IBM DS4000/5000 replication on big LUNs works again with hotfix firmware

A couple of weeks ago, I posted about my issues with replication of >2TB LUNs on IBM SANs not working correctly using Enhanced Remote Mirroring. Well, IBM got me to install some hotfix firmware (version 07.60.40.00), and the problem appears to be resolved, though I’m still having issues with Flash Copies of one of the affected mirror LUNs showing up to Windows as an empty, uninitialized disk. I’m getting married in a week and am too busy polishing documentation before I take 2 weeks off to open yet another case with IBM. C’est la vie.

They’re probably going to kill me for calling this “hotfix firmware,” since I was assured this firmware was GA but not uploaded to the website because of some release engineering red tape. (Whatever, guys, I can’t download it without calling you, so it’s a hotfix as far as I’m concerned.)

Anyway, if you’re having this issue or are planning on replicating large LUNs with IBM Enhanced Remote Mirroring, contact your IBM support engineers and request that they send you firmware >=07.60.40.00.

Recovering a deleted logical drive on an IBM Midrange Storage SAN

First, some keyword spam so this turns up to people who need it: this should apply to all IBM Midrange Series Storage SANs including the DS3200, DS3300, DS3400, DS3950, DS4000, DS4100, DS4200, DS4300, DS4400, DS4500, DS4700, DS4800, DS5020, DS5100, and DS5300. (Whew.)

SANs are important, mission-critical pieces of storage hardware, and as we all know, it’s important to manage change in the environment. However, sometimes mistakes happen — sooner or later, someone is going to delete the wrong LUN. IBM doesn’t really make clear how to recover this without technical support involved, and I can understand why — it’s an important thing to get right.

However, especially late at night when IBM’s Remote Support Center runs on a skeleton crew and can take a few hours to turn around a ticket, we can’t always rely on a timely response from IBM support in order to recover the disk. Since this is a largely undocumented procedure, I’m going to put it out there in the hopes that it helps someone else.

When doing advanced work, I tend to work from the command line using the SMcli utility. However, you can also run scripts in the graphical Storage Manager application. The functionality is oddly hidden in the root window of the DS Storage Manager 10 client, on the screen where you choose your SAN to manage. To access it, right-click your SAN and click “Execute Script.” The script editor window will open. (It would make a lot more sense to put this functionality into the Advanced menu of one of the managed SANs.)

The command-line reference guide for IBM Midrange Storage (LSI) SANs makes mention of the recover logicalDrive command:

recover logicalDrive (drive=(enclosureID,drawerID,slotID) |
Drives=(enclosureID1,drawerID1,slotID1 ... enclosureIDn,drawerIDn,slotIDn) |
array=ArrayName)
[newArray=arrayName]
userLabel="logicalDriveName"
capacity=logicalDriveCapacity
offset=offsetValue
raidLevel=(0 | 1 | 3 | 5 | 6)
segmentSize=segmentSizeValue
[owner=(a | b)
cacheReadPrefetch=(TRUE | FALSE)]

However, it doesn’t tell you where to get the LUN sizes, segment sizes, offsets and other numbers that you need to facilitate a successful recovery. Well, luckily, there’s a couple of places you can turn it up.

If you’ve collected support data recently, you can look inside the support bundle .zip and locate a file called recoveryProfile.csv. If you don’t have a support bundle handy, you might still be in luck — the DS Storage Manager application keeps a copy in its program directory, and you can usually find it at C:\Program Files\IBM_DS\client\data\recovery, ending in _Recovery_Profile.csv and named for the SAN you’re managing. Look at all the lines beginning with Volume, and locate the one that contains the LU name that you’re looking for. It should look like this:

Volume,600A0B80006E09620000BC914BF14835,My_LU,600A0B800047F5F20000BC914BF146C2,512,805306368000,393216000,65536,1,1

As far as I can tell, the fields are:

  • Object type (volume, volume group, etc.)
  • Volume NAA ID
  • Volume name
  • Owning array NAA ID
  • Block size (typically 512; this might be 4096 on SSD or high-capacity disks with 4k blocks, but I have none of these to test with)
  • LUN size in bytes
  • Starting offset; on this LUN the unit appears to be (bytes / 2048) but I can’t figure out why
  • Segment size in bytes
  • Two integers/booleans I haven’t identified

You can take this information and feed it right back into that recover logicalDrive command from the guide:

SMcli -n My_SAN -p My_Password -c 'recover logicalDrive array=My_Array userLabel="My_LU" capacity=805306368000 offset=393216000 raidLevel=5 segmentSize=64;'

Note that the segment size needs to be converted from bytes into kilobytes.

One thing I haven’t figured out is how to preserve the old NAA ID on the LUNs, if this is at all possible. This generally isn’t important, but notably can cause problems with signaturing in VMware.

Expect a follow-up post on restoring an entire physical array.

Replication of LUNs >2TB on IBM DS4000/DS5000 SANs flat-out doesn’t work

…but it says it does. It even reports that the mirroring completed successfully and that the volume status is “Synchronized” when the remote end in fact contains nothing but garbage data.

This is the result of what was described to me as a regression in a bad firmware release, but it’s unclear to me from my discussions with IBM exactly how far back this issue goes. I’m grateful that we didn’t find this in the middle of a production DR failover, but it’s completely ridiculous that an enterprise storage vendor allows such a serious data loss issue into a real release.

This is supposed to be fixed in a firmware update already GA’d but not on the website yet, but I’m awfully hesitant to actually use these large LUNs until IBM hashes out their support for them a little further. I’m not looking to be burned with the exact same thing with a different premium feature.

Monitoring Windows MPIO through Nagios

Sometimes, we need to do SAN maintenance — firmware upgrades, disruptive fabric changes, and the like. When these situations come up, it’s useful to know if anything is in a condition where it will break if it loses its connection to SAN storage, especially if you’re a lowly storage administrator without admin access to any of the Windows systems connected up to the SAN.

I poked around, and could not find one single utility or tool for monitoring the Windows MPIO framework, so I whipped up a quick script using VBScript and WMI. The script is called like so:

cscript.exe //NoLogo scripts\CheckMpioPaths.vbs /paths 4

(4 paths are used because the server is multipathed on two fabrics, and each of the active/passive controllers is also on each fabric — the server should see 2 controllers on 2 fabrics each, for 4 paths.)

This will cause the script to issue a Nagios CRITICAL if any multipath-registered LUN shows fewer than the given number of paths.

As usual, you can find the script in the GitHub repository for CheckMpioPaths.

Charting performance data for IBM Midrange Storage Series SANs with PNP4Nagios

If you’ve used IBM SAN products, particularly the DS4000, DS5000 and DS6000 series (which are rebranded LSI), one of the most obnoxious things about it is how you’re pretty much forced to roll your own monitoring tools. Compared to many mainstream vendors (and Sun/Oracle in particular), IBM’s performance monitoring and modelling tools have been lackluster at best and completely unsupplied at worst. The best tool you’ve got is the SMcli, which doesn’t supply a ton of good information, but at least provides you with a starting point for capacity planning.

I had originally wanted to make something like this for Cacti, which probably has a much broader install base than the pnp4nagios addon, but the Nagios way was just so easy, and I’d like to share it with anyone who doesn’t want to roll their own basic performance aggregator for it.

This tool gets the following statistics:

  • IOPS
  • Throughput
  • Read percentage
  • Cache hit percentage

It gets statistics at the following levels:

  • Logical Unit
  • Physical Array
  • Controller
  • Unit

It’s a little quick-and-dirty, but it works:

check_smcli_io

Like my other projects, it’s hosted on GitHub, so check out the GitHub project for check_smcli_io.