<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>holyhandgrenade.org &#187; Jeff</title>
	<atom:link href="http://holyhandgrenade.org/blog/author/admin/feed/" rel="self" type="application/rss+xml" />
	<link>http://holyhandgrenade.org/blog</link>
	<description>System administration from the trenches.</description>
	<lastBuildDate>Wed, 28 Jul 2010 05:31:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Update: IBM DS4000/5000 replication on big LUNs works again with hotfix firmware</title>
		<link>http://holyhandgrenade.org/blog/2010/07/update-ibm-ds40005000-replication-on-big-luns-works-again-with-hotfix-firmware/</link>
		<comments>http://holyhandgrenade.org/blog/2010/07/update-ibm-ds40005000-replication-on-big-luns-works-again-with-hotfix-firmware/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 05:21:40 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[san]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=684</guid>
		<description><![CDATA[A couple of weeks ago, I posted about my issues with replication of &#62;2TB LUNs on IBM SANs not working correctly using Enhanced Remote Mirroring. Well, IBM got me to install some hotfix firmware (version 07.60.40.00), and the problem appears to be resolved, though I&#8217;m still having issues with Flash Copies of one of the [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago, I posted about my issues with <a href="http://holyhandgrenade.org/blog/2010/07/recovering-a-deleted-logical-drive-on-an-ibm-midrange-storage-san/">replication of &gt;2TB LUNs on IBM SANs not working correctly</a> using Enhanced Remote Mirroring. Well, IBM got me to install some hotfix firmware (<a href="http://download2.boulder.ibm.com/sar/CMA/SDA/00yjs/0/ibm_fw_ds4kfc_07604000_anyos_anycpu.chg">version 07.60.40.00</a>), and the problem appears to be resolved, though I&#8217;m still having issues with Flash Copies of one of the affected mirror LUNs showing up to Windows as an empty, uninitialized disk. I&#8217;m getting married in a week and am too busy polishing documentation before I take 2 weeks off to open yet another case with IBM. C&#8217;est la vie.</p>
<p>They&#8217;re probably going to kill me for calling this &#8220;hotfix firmware,&#8221; since I was assured this firmware was GA but not uploaded to the website because of some release engineering red tape. (Whatever, guys, I can&#8217;t download it without calling you, so it&#8217;s a hotfix as far as I&#8217;m concerned.)</p>
<p>Anyway, if you&#8217;re having this issue or are planning on replicating large LUNs with IBM Enhanced Remote Mirroring, contact your IBM support engineers and request that they send you firmware &gt;=07.60.40.00.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/07/update-ibm-ds40005000-replication-on-big-luns-works-again-with-hotfix-firmware/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Recovering a deleted logical drive on an IBM Midrange Storage SAN</title>
		<link>http://holyhandgrenade.org/blog/2010/07/recovering-a-deleted-logical-drive-on-an-ibm-midrange-storage-san/</link>
		<comments>http://holyhandgrenade.org/blog/2010/07/recovering-a-deleted-logical-drive-on-an-ibm-midrange-storage-san/#comments</comments>
		<pubDate>Mon, 26 Jul 2010 16:17:53 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[san]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=655</guid>
		<description><![CDATA[First, some keyword spam so this turns up to people who need it: this should apply to all IBM Midrange Series Storage SANs including the DS3200, DS3300, DS3400, DS3950, DS4000, DS4100, DS4200, DS4300, DS4400, DS4500, DS4700, DS4800, DS5020, DS5100, and DS5300. (Whew.) SANs are important, mission-critical pieces of storage hardware, and as we all know, [...]]]></description>
			<content:encoded><![CDATA[<p>First, some keyword spam so this turns up to people who need it: this should apply to all IBM Midrange Series Storage SANs including the DS3200, DS3300, DS3400, DS3950, DS4000, DS4100, DS4200, DS4300, DS4400, DS4500, DS4700, DS4800, DS5020, DS5100, and DS5300. (Whew.)</p>
<p>SANs are important, mission-critical pieces of storage hardware, and as we all know, it&#8217;s important to manage change in the environment. However, sometimes mistakes happen &#8212; sooner or later, someone is going to delete the wrong LUN. IBM doesn&#8217;t really make clear how to recover this without technical support involved, and I can understand why &#8212; it&#8217;s an important thing to get right.</p>
<p>However, especially late at night when IBM&#8217;s Remote Support Center runs on a skeleton crew and can take a few hours to turn around a ticket, we can&#8217;t always rely on a timely response from IBM support in order to recover the disk. Since this is a largely undocumented procedure, I&#8217;m going to put it out there in the hopes that it helps someone else.</p>
<p>When doing advanced work, I tend to work from the command line using the SMcli utility. However, you can also run scripts in the graphical Storage Manager application. The functionality is oddly hidden in the root window of the DS Storage Manager 10 client, on the screen where you choose your SAN to manage. To access it, right-click your SAN and click &#8220;Execute Script.&#8221; The script editor window will open. (It would make a lot more sense to put this functionality into the Advanced menu of one of the managed SANs.)</p>
<p>The <a href="ftp://ftp.software.ibm.com/systems/support/system_x_pdf/59y7286.pdf">command-line reference guide</a> for IBM Midrange Storage (LSI) SANs makes mention of the <strong>recover logicalDrive</strong> command:</p>
<pre>recover logicalDrive (drive=(e<em>nclosureID,drawerID,slotID</em>) |
Drives=(<em>enclosureID1,drawerID1,slotID1 ... enclosureIDn,drawerIDn,slotIDn</em>) |
array=<em>ArrayName</em>)
[newArray=<em>arrayName</em>]
userLabel="<em>logicalDriveName</em>"
capacity=<em>logicalDriveCapacity</em>
offset=<em>offsetValue</em>
raidLevel=(0 | 1 | 3 | 5 | 6)
segmentSize=<em>segmentSizeValue</em>
[owner=(a | b)
cacheReadPrefetch=(TRUE | FALSE)]</pre>
<p>However, it doesn&#8217;t tell you where to get the LUN sizes, segment sizes, offsets and other numbers that you need to facilitate a successful recovery. Well, luckily, there&#8217;s a couple of places you can turn it up.</p>
<p>If you&#8217;ve collected support data recently, you can look inside the support bundle .zip and locate a file called <strong>recoveryProfile.csv</strong>. If you don&#8217;t have a support bundle handy, you might still be in luck &#8212; the DS Storage Manager application keeps a copy in its program directory, and you can usually find it at <strong>C:\Program Files\IBM_DS\client\data\recovery</strong>, ending in <strong>_Recovery_Profile.csv</strong> and named for the SAN you&#8217;re managing. Look at all the lines beginning with <strong>Volume</strong>, and locate the one that contains the LU name that you&#8217;re looking for. It should look like this:</p>
<pre>﻿Volume,600A0B80006E09620000BC914BF14835,My_LU,600A0B800047F5F20000BC914BF146C2,512,805306368000,393216000,65536,1,1</pre>
<p>As far as I can tell, the fields are:</p>
<ul>
<li>Object type (volume, volume group, etc.)</li>
<li>Volume NAA ID</li>
<li>Volume name</li>
<li>Owning array NAA ID</li>
<li>Block size (typically 512; this might be 4096 on SSD or high-capacity disks with 4k blocks, but I have none of these to test with)</li>
<li>LUN size in bytes</li>
<li>Starting offset; on this LUN the unit appears to be (bytes / 2048) but I can&#8217;t figure out why</li>
<li>Segment size in bytes</li>
<li>Two integers/booleans I haven&#8217;t identified</li>
</ul>
<p>You can take this information and feed it right back into that <strong>recover logicalDrive</strong> command from the guide:</p>
<pre>SMcli -n My_SAN -p My_Password -c 'recover logicalDrive array=My_Array userLabel="My_LU" capacity=805306368000 offset=393216000 raidLevel=5 segmentSize=64;'</pre>
<p>Note that the segment size needs to be converted from bytes into kilobytes.</p>
<p>One thing I haven&#8217;t figured out is how to preserve the old NAA ID on the LUNs, if this is at all possible. This generally isn&#8217;t important, but notably can cause problems with <a href="http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/">signaturing in VMware</a>.</p>
<p>Expect a follow-up post on restoring an entire physical array.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/07/recovering-a-deleted-logical-drive-on-an-ibm-midrange-storage-san/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Replication of LUNs &gt;2TB on IBM DS4000/DS5000 SANs flat-out doesn&#8217;t work</title>
		<link>http://holyhandgrenade.org/blog/2010/07/replication-of-luns-2tb-on-ibm-ds4000ds5000-sans-flat-out-doesnt-work/</link>
		<comments>http://holyhandgrenade.org/blog/2010/07/replication-of-luns-2tb-on-ibm-ds4000ds5000-sans-flat-out-doesnt-work/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 16:40:51 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[san]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=645</guid>
		<description><![CDATA[&#8230;but it says it does. It even reports that the mirroring completed successfully and that the volume status is &#8220;Synchronized&#8221; when the remote end in fact contains nothing but garbage data. This is the result of what was described to me as a regression in a bad firmware release, but it&#8217;s unclear to me from [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;but it says it does. It even reports that the mirroring completed successfully and that the volume status is &#8220;Synchronized&#8221; when the remote end in fact contains nothing but garbage data.</p>
<p>This is the result of what was described to me as a regression in a bad firmware release, but it&#8217;s unclear to me from my discussions with IBM exactly how far back this issue goes. I&#8217;m grateful that we didn&#8217;t find this in the middle of a production DR failover, but it&#8217;s completely ridiculous that an enterprise storage vendor allows such a serious data loss issue into a real release.</p>
<p>This is supposed to be fixed in a firmware update already GA&#8217;d but not on the website yet, but I&#8217;m awfully hesitant to actually use these large LUNs until IBM hashes out their support for them a little further. I&#8217;m not looking to be burned with the exact same thing with a different premium feature.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/07/replication-of-luns-2tb-on-ibm-ds4000ds5000-sans-flat-out-doesnt-work/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Practical VMFS signatures</title>
		<link>http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/</link>
		<comments>http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 16:31:57 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=635</guid>
		<description><![CDATA[VMFS is a pretty cool (if relatively undocumented) filesystem, and VMFS volumes were designed with one particular quirk that&#8217;s both a blessing and a curse &#8212; when you create the volume, ESX writes a few pieces of information to the disk signature on the volume that helps it identify the volume and figure out what [...]]]></description>
			<content:encoded><![CDATA[<p>VMFS is a pretty cool (if relatively undocumented) filesystem, and VMFS volumes were designed with one particular quirk that&#8217;s both a blessing and a curse &#8212; when you create the volume, ESX writes a few pieces of information to the disk signature on the volume that helps it identify the volume and figure out what to do with it. Each volume contains a UUID used to uniquely identify it when multiple volumes with the same name are presented, but each volume also registers one of three unique identifiers to the volume &#8212; the Network Addressing Authority (NAA) ID, an Extended Unique Identifier (EUI), or, on storage subsystems that don&#8217;t support either of the other two identifiers, a LUN number. If you&#8217;re interested in the low-level nitty-gritty of it, Ubiquitous Talk published a <a href="http://blog.laspina.ca/ubiquitous/understanding-vmfs-volumes">really great blog entry on VMFS on-disk signatures</a> that you should read.</p>
<p>This decision was made because most people run VMware off of Fibre Channel or iSCSI SAN, where you may do something like taking a copy-on-write snapshot of a VMFS volume and presenting it back to the host as a read-only volume. ESX compares the identifier presented by the LUN to the one written to the disk signature, the one that it expects to see. If they mismatch, it&#8217;s assumed that it&#8217;s not the original LUN and that it&#8217;s a copy. Sometimes this isn&#8217;t the case, and your underlying storage has actually changed, either because you made a copy to another LUN or because you&#8217;re trying to mount a replicated copy on another SAN at your disaster recovery site. In these cases, to mount the volume as a normal writable volume, you need to resignature it, which re-writes all of the above information to the disk. Since this information includes the UUID, and ESX uses that UUID to reference virtual machines in its inventory, you&#8217;ll need to manually re-add all of the virtual machines back to your cluster. This is one of those annoying things that Site Recovery Manager was designed to automate (<a href="http://searchvmware.techtarget.com/tip/0,289483,sid179_gci1506442,00.html">see TechTarget article</a>).</p>
<p>ESX 3.5 used to automatically mount and present the volume as a read-only snapshot with a new unique name. If you wanted to resignature the volume, you would set the advanced setting LVM.EnableResignature to 1, and you would rescan for volumes. This had the unfortunate consequence of re-signaturing <em>all</em> volumes, even if you only intended to resignature one of them. A new esxcfg-volume command was added to perform this operation, and VMware changed the default behavior in the GUI so that if a volume is detected as a snapshot, you have to manually add the storage, at which time you&#8217;ll be prompted about whether you want to mount the volume as a snapshot or if you want to resignature it.</p>
<h2>Problem 1: Cloned boot LUNs don&#8217;t boot</h2>
<p>As of ESX 4, the service console resides in a VMDK on a VMFS volume, so you can run into major issues if you boot from SAN and fail your boot LUNs over to your DR site, because the filesystem used to store your service console is subject to the exact same signaturing issues as other VMFS volumes. The boot LUNs on each server need to be manually resignatured, which is <a href="http://getvirtical.blogspot.com/2009_11_01_archive.html">covered by Get VIRTical</a> and <a href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;cmd=displayKC&amp;externalId=1012874">VMware KB 1012874</a>.</p>
<h2>Problem 2: Non-contiguous extents don&#8217;t resignature (right now)</h2>
<p>A couple of weeks ago, I ran into a rather significant and nasty regression in vSphere 4. I had taken a 500GB VMFS volume, added a second 500GB extent, and then I had grown each of the volumes by 250GB. As a result, the first LUN occupied 0-500GB and 1000-1250GB in the VMFS volume, while the second LUN occupied 500-1000GB and 1250-1500GB. Notice that the second disk&#8217;s start is before the first disk&#8217;s end (and this should be fine in any reasonable logical volume manager).</p>
<p>This worked fine for months. When I failed over to our DR site, the volume was detected as a snapshot and I couldn&#8217;t resignature because esxcfg-volume thought the LUNs overlapped:</p>
<p><code><br />
[root@esx01 ~]# esxcfg-volume -l<br />
VMFS3 UUID/label: 4bc639b4-21bbc059-d77b-e41f132c2a8a/shared-esxdev<br />
Can mount: No (duplicate extents found)<br />
Can resignature: No (duplicate extents found)<br />
Extent name: naa.600a0b800047f5f20000bc934bf1480e:1     range: 0 - 1279487 (MB)<br />
Extent name: naa.600a0b80006e09620000bc914bf14835:1     range: 511744 - 1535487 (MB)<br />
</code></p>
<p>After going back and forth with VMware for a very long time on this issue, they finally determined it to be a bug in 4.0 that prevents 4.0 from resignaturing the volume. Don&#8217;t extend any volumes defined as extents in a VMFS filesystem (VMware&#8217;s recommendation is to not use extents at all unless you absolutely need them to extend a VMFS volume beyond 2TB). If this issue bites you, you can get around it by presenting the volume to an ESX 3.5 host, setting LVM.EnableResignature to 1, rescanning/resignaturing, and then presenting the LUNs back to an ESX 4 host. This should hopefully be fixed by 4.1 U1.</p>
<p>As a final aside: it looks like the open-source VMFS driver has similar problems with non-contiguous extents (it throws back garbage data). I haven&#8217;t reported that as a bug yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/07/practical-vmfs-signatures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sometimes the cloud cuts costs even if you don&#8217;t use it</title>
		<link>http://holyhandgrenade.org/blog/2010/06/sometimes-the-cloud-cuts-costs-even-if-you-dont-use-it/</link>
		<comments>http://holyhandgrenade.org/blog/2010/06/sometimes-the-cloud-cuts-costs-even-if-you-dont-use-it/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 20:46:46 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[exchange]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=581</guid>
		<description><![CDATA[I was having a discussion with one of our Windows administrators a few weeks ago about Exchange 2010, which makes some pretty substantial departures from how Exchange did things in the past. Since I&#8217;m mostly a Linux and VMware guy I don&#8217;t want to get too much into the product itself (I&#8217;m sure the MS [...]]]></description>
			<content:encoded><![CDATA[<p>I was having a discussion with one of our Windows administrators a few weeks ago about Exchange 2010, which makes some pretty substantial departures from how Exchange did things in the past. Since I&#8217;m mostly a Linux and VMware guy I don&#8217;t want to get too much into the product itself (I&#8217;m sure the <a href="http://msexchangeteam.com/">MS Exchange Team</a> has plenty of that), but the biggest change is that instead of four clustering modes, Exchange 2010 has one, and it doesn&#8217;t require any shared storage. The gist of it is that databases are organized into Database Availability Groups, and the databases that form them are replicated around the DAG in an arrangement rather like a big distributed RAID-10 array, with the difference that databases and not logical blocks are being striped across the replication group. When a server goes down, each database pops back up on another node in the cluster. Because of some back-end database improvements (namely, the elimination of single-instance storage, which stores only a single copy of a message or attachment sent to multiple users in the same database), Exchange 2010 cuts down random disk I/O by a huge amount, making it much simpler to run on commodity direct-attached disk with little to no penalty. Combine this with the removal of the shared storage requirement, and you no longer need a SAN to run clustered Exchange.</p>
<p>(Before any Exchange people chime in: yes, I&#8217;m aware that continuous copy replication/log shipping has been available since Exchange 2007. It just wasn&#8217;t viable for larger environments because you couldn&#8217;t easily distribute where the databases got replicated, meaning you either ran a replication slave for each Exchange server or seriously overspecified/overcommitted your hardware.)</p>
<p>Microsoft minces words pretty frequently to save face with customers (as most corporations do), and they&#8217;re still pushing the opinion that high-end SAN storage is a good idea so as not to rock the boat and upset anyone who already shelled out for high-end SAN hardware to run Exchange. However, the truth of it as far as I can surmise is that Microsoft specifically redesigned Exchange to work on commodity hardware in order to cut operating expenses on their own hosted Exchange offering.</p>
<p>Many applications are seeing a major paradigm shift towards distributed processing like Hadoop and schemaless NoSQL distributed data stores like,  MongoDB and HBase, and proprietary software vendors are starting to take notice and move towards better use of commodity hardware. When there&#8217;s a lot of engineering effort involved, though, sometimes the best incentive for a company to improve the efficiency of their products is to try to make money on it themselves, and the results can benefit everybody.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/06/sometimes-the-cloud-cuts-costs-even-if-you-dont-use-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Show me your tests</title>
		<link>http://holyhandgrenade.org/blog/2010/06/show-me-your-tests/</link>
		<comments>http://holyhandgrenade.org/blog/2010/06/show-me-your-tests/#comments</comments>
		<pubDate>Mon, 07 Jun 2010 16:57:24 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[releng]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=617</guid>
		<description><![CDATA[(I just posted this as a comment on one of Chris Siebenmann&#8217;s posts, but it was long enough where I felt it warranted reposting here.) &#8211; For a sysadmin, testing software is really, really hard. We&#8217;re constantly stuck in a cycle where we either patch a piece of software and introduce unwanted bugs and regressions, [...]]]></description>
			<content:encoded><![CDATA[<p>(I just posted this as a comment on <a href="http://utcc.utoronto.ca/~cks/space/blog/sysadmin/SysadminTestingProblem">one of Chris Siebenmann&#8217;s posts</a>, but it was long enough where I felt it warranted reposting here.)</p>
<p>&#8211;</p>
<p>For a sysadmin, testing software is really, really hard. We&#8217;re constantly stuck in a cycle where we either patch a piece of software and introduce unwanted bugs and regressions, or we leave it unpatched and often vulnerable, worse-performing and missing important new features. There are many tricks we&#8217;ve learned over the years to make the process easier, but it&#8217;s still a fundamentally difficult activity.</p>
<p>Since I came to system administration from the development side of the fence, I&#8217;ve always had a keen fascination with the similarities and differences with the way that software developers and system administrators (and the project managers who herd them) go about their jobs. In particular, I find it amazing that neither role generally has a good grasp of how the other functions, and how they can better work together.</p>
<p>I think that an interesting part of this problem is that software developers have a much easier time of testing things than system administrators do. For everyone to understand my viewpoint, I need to qualify it by saying that when a system administrator needs to test a new release of software before deploying it to a production system, it&#8217;s generally not to make sure that any new features introduced in the software are bug-free, because it&#8217;s simple enough to just document the problems and not use those features until they&#8217;ve stabilized. Rather, the issue is that we need to identify regressions in pieces of code that used to work fine and are now broken.</p>
<p>In the software industry, this is what unit testing is for. Unit testing allows developers to provide a comprehensive set of test cases for a particular function, and make sure that the method works properly for all of them and returns the expected result. Many agile developers believe in writing tests first, then code, and aiming for 100% test coverage to minimize unintended regressions from rapid code changes.</p>
<p>I&#8217;m not recommending that system administrators should automate testing other people&#8217;s software, because there&#8217;s no standardized model for business requirements. However, I do think that a little transparency into the development model of our upstream developers would help us to figure out where testing is and isn&#8217;t necessary.</p>
<p>While it&#8217;s not adopted across all of the software industry, unit testing is very popular in many rapid development scenarios, and has become more or less institutionalized in certain developer communities like CPAN. If you&#8217;re a developer, or at least, if you develop software without gluing together huge numbers of third-party libraries, it&#8217;s pretty simple for you to gauge regressions in your own software, because you know (or can easily find out) what the test coverage is for your own project. If you have really thorough unit test coverage, and your test cases are properly written, you shouldn&#8217;t have any function/method-level regressions slipping into production code when there&#8217;s an update. This doesn&#8217;t give the developers a ton of insight into the complex problems, like integration-level or system-level issues, but at least it provides a basic understanding that no minor and insidious issues are creeping up the chain and causing undetected problems.</p>
<p>The problem with unit testing is that the developers run the tests, and they run them on their own systems. This methodology can lead to some really bothersome problems for other people.</p>
<p>When you&#8217;re a system administrator, and especially if you&#8217;re a system administrator who deals with a lot of proprietary, closed-source software, it becomes very difficult to understand the development methodologies of every single piece of software you plan to update. There&#8217;s a certain amount of trust that goes into your Linux vendor&#8217;s ability to not break things like glibc that aren&#8217;t easily tested. I think the ability to trust a vendor&#8217;s stability track record is a wonderful thing, but it&#8217;s something that shouldn&#8217;t be necessary for system administrators. We should be able to validate the correctness of code on our systems, with our configurations, without fighting the developers for the right to do it.</p>
<p>There&#8217;s a constant impedance mismatch and a constant communication gap between developers and sysadmins that needs to be bridged. Software developers need to understand that most sysadmins aren&#8217;t developers, and we need an easy way to perform basic correctness validation on the software we install, especially if we install it from the distribution&#8217;s or developer&#8217;s packages and aren&#8217;t running a &#8220;make test&#8221; or similar during the install process. We need to understand what&#8217;s being tested, we need to understand the significance of the test coverage, and we need to be able to figure out what does and doesn&#8217;t warrant further testing. As it stands, all the validation that developers are (or aren&#8217;t) doing is lost on us, because we don&#8217;t get a warm-and-fuzzy from tests that someone else is running that we&#8217;ll probably never get to see.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/06/show-me-your-tests/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring Windows MPIO through Nagios</title>
		<link>http://holyhandgrenade.org/blog/2010/05/monitoring-windows-mpio-through-nagios/</link>
		<comments>http://holyhandgrenade.org/blog/2010/05/monitoring-windows-mpio-through-nagios/#comments</comments>
		<pubDate>Sun, 30 May 2010 18:08:44 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[nagios]]></category>
		<category><![CDATA[san]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=615</guid>
		<description><![CDATA[Sometimes, we need to do SAN maintenance &#8212; firmware upgrades, disruptive fabric changes, and the like. When these situations come up, it&#8217;s useful to know if anything is in a condition where it will break if it loses its connection to SAN storage, especially if you&#8217;re a lowly storage administrator without admin access to any [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes, we need to do SAN maintenance &#8212; firmware upgrades, disruptive fabric changes, and the like. When these situations come up, it&#8217;s useful to know if anything is in a condition where it will break if it loses its connection to SAN storage, especially if you&#8217;re a lowly storage administrator without admin access to any of the Windows systems connected up to the SAN.</p>
<p>I poked around, and could not find one single utility or tool for monitoring the Windows MPIO framework, so I whipped up a quick script using VBScript and WMI. The script is called like so:</p>
<p style="padding-left: 30px;">cscript.exe //NoLogo scripts\CheckMpioPaths.vbs /paths 4</p>
<p>(4 paths are used because the server is multipathed on two fabrics, and each of the active/passive controllers is also on each fabric &#8212; the server should see 2 controllers on 2 fabrics each, for 4 paths.)</p>
<p>This will cause the script to issue a Nagios CRITICAL if any multipath-registered LUN shows fewer than the given number of paths.</p>
<p>As usual, you can find the script in the <a href="http://github.com/jgoldschrafe/CheckMpioPaths">GitHub repository for CheckMpioPaths</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/05/monitoring-windows-mpio-through-nagios/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Just use someone else&#8217;s coding convention already</title>
		<link>http://holyhandgrenade.org/blog/2010/05/just-use-someone-elses-coding-convention-already/</link>
		<comments>http://holyhandgrenade.org/blog/2010/05/just-use-someone-elses-coding-convention-already/#comments</comments>
		<pubDate>Tue, 25 May 2010 03:30:35 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[coding]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=604</guid>
		<description><![CDATA[If there&#8217;s one thing that&#8217;s bugged me throughout my entire coding career, it&#8217;s the fact that I can&#8217;t seem to stick to a single coding style for a given language. Scope decorators, braces, spacing around parentheses, Hungarian notation, variable and method naming conventions &#8212; there&#8217;s so many stupid and trivial things to think about, with [...]]]></description>
			<content:encoded><![CDATA[<p>If there&#8217;s one thing that&#8217;s bugged me throughout my entire coding career, it&#8217;s the fact that I can&#8217;t seem to stick to a single coding style for a given language. Scope decorators, braces, spacing around parentheses, Hungarian notation, variable and method naming conventions &#8212; there&#8217;s so many stupid and trivial things to think about, with so many exceptions and gotchas, that after a while it seems like you end up putting half as much time into figuring out how you&#8217;re going to write your program as you actually do designing and coding it. Months later, I&#8217;ll have an epiphany, and change my coding style, until months after that, I&#8217;ll have another epiphany and change it back. This is an endless cycle.</p>
<p>Some languages are easier than others. Ironically, I have very little issue with Perl, but C++ gives me this headache every time I try to code something. It never resulted in bad code quality &#8212; I don&#8217;t think that any of the conventions, in and of themselves, were bad &#8212; but I occasionally sort of lost sight of what I was actually supposed to be doing.</p>
<p>Recently, I started work on a small C++ hobby project, wasted a ton of time, and got completely sick of this song and dance. I had again spent so long playing with my damn coding conventions that I failed to actually get work done.</p>
<p>I poked around for a little bit, and I ended up just going with <a href="http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml">Google&#8217;s C++ style guide</a>. I didn&#8217;t love it; I didn&#8217;t even really like it. There&#8217;s a lot of things I completely hate about it. But Google is telling me to shut the fuck up and write some damn code, and it makes it easier to focus on what actually matters &#8212; writing (and finishing) a program that does what it&#8217;s supposed to.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/05/just-use-someone-elses-coding-convention-already/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Resilient infrastructures are only useful if they actually stay resilient</title>
		<link>http://holyhandgrenade.org/blog/2010/05/resilient-infrastructures-are-only-useful-if-they-actually-stay-resilient/</link>
		<comments>http://holyhandgrenade.org/blog/2010/05/resilient-infrastructures-are-only-useful-if-they-actually-stay-resilient/#comments</comments>
		<pubDate>Mon, 24 May 2010 21:32:53 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ha]]></category>
		<category><![CDATA[monitoring]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=591</guid>
		<description><![CDATA[Ask yourself a question: for every piece of resiliency you supposedly have in your network, are you really positive that it&#8217;s not running in a degraded state? Really, really sure? Sometimes, it&#8217;s basic: are you being alerted when any disk array attached to any server suffers a disk failure? Very often, it&#8217;s not: for your [...]]]></description>
			<content:encoded><![CDATA[<p>Ask yourself a question: for every piece of resiliency you supposedly have in your network, are you really positive that it&#8217;s not running in a degraded state? Really, <em>really</em> sure?</p>
<p>Sometimes, it&#8217;s basic: are you being alerted when any disk array attached to any server suffers a disk failure?</p>
<p>Very often, it&#8217;s not: for your SAN-attached systems, are you positive that the multipathing is green? If you&#8217;re connected to two storage processors or controllers, can the server see two paths to each of them? Are you getting alerted if you can&#8217;t?</p>
<p>Are your port channels running over the number of links that they&#8217;re supposed to? How about the ISLs on your FC fabrics?</p>
<p>If you have failover clusters where services run on preferred nodes, are you sure they&#8217;re actually located where they&#8217;re supposed to be? Are you monitoring that services are all running on their preferred nodes?</p>
<p>If you have asymmetric fall-back connections, like a gigabit switch uplink used to back up a 10-gigabit switch uplink, are you notified when it&#8217;s using the backup connection, or do you rely on your users to tell you that things seem to be running slowly?</p>
<p>There&#8217;s a difference between things running, and things running <em>smoothly</em>: making sure that your &#8220;redundant&#8221; equipment and services are actually redundant is the key to keeping issues from turning into problems.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/05/resilient-infrastructures-are-only-useful-if-they-actually-stay-resilient/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Charting performance data for IBM Midrange Storage Series SANs with PNP4Nagios</title>
		<link>http://holyhandgrenade.org/blog/2010/05/charting-performance-data-for-ibm-midrange-storage-series-sans-with-pnpnagios/</link>
		<comments>http://holyhandgrenade.org/blog/2010/05/charting-performance-data-for-ibm-midrange-storage-series-sans-with-pnpnagios/#comments</comments>
		<pubDate>Mon, 24 May 2010 16:08:49 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[nagios]]></category>
		<category><![CDATA[pnp4nagios]]></category>
		<category><![CDATA[san]]></category>

		<guid isPermaLink="false">http://holyhandgrenade.org/blog/?p=579</guid>
		<description><![CDATA[If you&#8217;ve used IBM SAN products, particularly the DS4000, DS5000 and DS6000 series (which are rebranded LSI), one of the most obnoxious things about it is how you&#8217;re pretty much forced to roll your own monitoring tools. Compared to many mainstream vendors (and Sun/Oracle in particular), IBM&#8217;s performance monitoring and modelling tools have been lackluster [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;ve used IBM SAN products, particularly the DS4000, DS5000 and DS6000 series (which are rebranded LSI), one of the most obnoxious things about it is how you&#8217;re pretty much forced to roll your own monitoring tools. Compared to many mainstream vendors (and Sun/Oracle in particular), IBM&#8217;s performance monitoring and modelling tools have been lackluster at best and completely unsupplied at worst. The best tool you&#8217;ve got is the SMcli, which doesn&#8217;t supply a ton of good information, but at least provides you with a starting point for capacity planning.</p>
<p>I had originally wanted to make something like this for Cacti, which probably has a much broader install base than the pnp4nagios addon, but the Nagios way was just so <em>easy</em>, and I&#8217;d like to share it with anyone who doesn&#8217;t want to roll their own basic performance aggregator for it.</p>
<p>This tool gets the following statistics:</p>
<ul>
<li>IOPS</li>
<li>Throughput</li>
<li>Read percentage</li>
<li>Cache hit percentage</li>
</ul>
<p>It gets statistics at the following levels:</p>
<ul>
<li>Logical Unit</li>
<li>Physical Array</li>
<li>Controller</li>
<li>Unit</li>
</ul>
<p>It&#8217;s a little quick-and-dirty, but it works:</p>
<p><img class="alignnone size-medium wp-image-584" title="check_smcli_io" src="http://holyhandgrenade.org/blog/wp-content/uploads/2010/05/check_smcli_io-300x122.png" alt="check_smcli_io" width="300" height="122" /></p>
<p>Like my other projects, it&#8217;s hosted on GitHub, so check out the <a href="http://github.com/jgoldschrafe/check_smcli_io">GitHub project for check_smcli_io</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://holyhandgrenade.org/blog/2010/05/charting-performance-data-for-ibm-midrange-storage-series-sans-with-pnpnagios/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
