Disk Performance, Part 2: RAID Layouts and Stripe Sizing

In Part 1, I discussed how storage performance is typically measured in random IOPS, and talked about how to calculate them for a single spinning disk and a RAID array. Today, I’m going to get into the nitty-gritty of striping in RAID-5 and RAID-6, and discuss how to determine the optimal stripe width for your server configuration.

For a lot of workloads, this will be premature optimization. I’d advise you not to think too hard about your storage subsystem unless you’re actually worried that you will be I/O-constrained. Most of these considerations, implemented appropriately, will cut down on your total number of disk operations, but won’t make things faster on an undersubscribed system, where rotational latency and seek times are probably your only pertinent bottlenecks. It’s a better idea to invest your time elsewhere, like finding ways to make your systems easier to manage.

Also note that this article won’t tell you how to get all the numbers you need to properly size your array — not yet, and I plan on getting to that in the near future — but I hope to give you an understanding of what to watch out for as well as a starting point for figuring out how to profile your own applications.

Revisiting nomenclature

From Part 1:

  • Segment size: The amount of data written to a single disk within a RAID stripe.
  • Stripe width: The amount of data contained in a single RAID stripe (segment size × number of data-bearing disks).

Understand your application

If there’s one thing I need to hammer on over and over and over, it’s that you need to understand your application in order to make storage decisions. In particular, there are a few details you should pay attention to, and in the next few days I’ll cover how to find them.

First, if you’re running a commercially-supported application, your vendor probably has some advice on how your RAID array should be configured. That should be your starting point. If you don’t find any specific recommendations, you may be able to find some information about the software.

For now, though, I’ll cover a few basic things you should be asking.

  • Type of I/O: Is your workload predominantly sequential or random? What’s your percentage of reads to writes?
  • Size of I/O: What is the typical read or write size in your application? How much data does the application read or write, buffered, at once?
  • Coalescing: Does your application batch writes in order to cut down on the number of discrete I/O operations before sending them to disk? Does your OS? Does your filesystem?
  • Alignment: It doesn’t matter if your application requests data in nice, even, stripe-sized chunks if those don’t line up perfectly with your data on disk. Much of the time, despite the best efforts of application developers, the underlying filesystem, volume manager, or partition table can introduce unwanted alignment problems that split your I/O over disks or between RAID stripes. I’ll be covering this more in Part 3.

Understand your vendor

For the remainder of this post, I’m basically going to ignore caching. It’s incredibly important — maybe more important to your performance than all the disk-level recommendations in here combined — but each vendor does it so differently that it’s impossible to make useful generalizations. The important thing is that your controller has a battery-backed write-back cache, that the battery is installed, and that your cache is working.

Please, don’t take anything I say here as gospel. There’s huge variances in the way things are implemented between RAID controllers. Certain optimizations may work on one type of card that don’t work on another. Certain controllers, interfaces, or storage networks flat-out might not perform well on certain configurations.

Bottom line: read your documentation, and consider your vendor’s recommendations.

Mixing workload types: don’t

In my first draft of this post, I forgot this. It’s important.

There are two main types of workloads: sequential and random. Do not mix these on the same array because your random I/O will screw up your sequential I/O by making your drives seek all over the place.

If possible, keep your reads and writes separate as well — this generally reduces contention. For example, if you’re running a database with a separate transaction log, like Microsoft SQL Server or Oracle, keep it on a separate volume. If you’re running an XFS filesystem that’s doing a lot of random I/O, you can keep the journal device on another array for better performance. (Note that this may add another point of failure for your volume, and that may not be acceptable.)

If you’re using a SAN that allows you to create multiple LUNs backed by the same physical array,keep in mind that your LUNs are backed by the same set of disks, and from a disk performance perspective it makes almost no difference whether one LUN or a hundred are being written to. Contention on the array will be contention on the array regardless of whether it’s the same filesystem or not.

Segment sizing

Segment sizes have different impacts, and are arrived at in different ways, depending on what type of array you’re using.

Striping without parity (RAID-0, RAID-0+1)

Because you’re not calculating parity, stripe width is literally irrelevant. That makes calculating your ideal segment size a whole lot easier. I’m going to go over the four main kinds of I/O, and my recommendations for how to deal with them.

Sequential reads: If your workload requests very large I/O sizes for long periods of time, like processing very large files by using very large reads, you’ll benefit from keeping this smaller so you can stream off of multiple disks at once — you want to aim for as large a block size as possible that will still allow you to saturate all of your disks. If it doesn’t, and it synchronously asks for small pieces of data at a time, you’ll get better concurrency if you use larger block sizes and leave your other disks free to service requests from other processes/threads.

To arrive at a number, start with a large block size and incrementally decrease it until the next step down doesn’t get any faster. If you’re not sure, 128k-512k is usually a good range.

Sequential writes: Like the above, if your application streams sequential data very quickly to disk, set your segment size a bit smaller so your controller will be able to saturate multiple disks at once. If it doesn’t issue very large writes to the controller, you’ll benefit from a very large block size; this will help to keep your other disks free while one disk is being written at a time.

To arrive at a number, start with a large block size and incrementally decrease it until the next step down doesn’t get any faster. If you’re not sure, 128k-512k is usually a good range.

Random reads: The important detail with random reads is that each read operation should come from as few disks as possible. You want to set your segments to be at least as large as your average read size to minimize the number of disks needed for any particular read. For a huge majority of applications, setting it too large won’t have nearly as much of an impact as setting it to small.

Profiling your application will get your ideal numbers, but anything smaller than 32k generally isn’t recommended — in addition to the disk I/O penalties, segment sizes this small tend to overburden the controller and cause latency problems. If you’re not sure, 64k-128k will get you good all-around performance with most applications that are heavy on small random reads. If your random reads are larger and pseudo-sequential, like in Microsoft Exchange 2010, you may want to go as high as 256k.

Random writes: As with reads, each write should go to as few disks as possible; sizing your segments too small causes unnecessary seeks and latency. Your software documentation should help you determine the best size for random writes. If you’re not sure, a 64k-128k stripe width usually works very well, with some vendors recommending 256k or even higher. Again, run your own benchmarks and draw your own conclusions.

Striping with parity (RAID-5, RAID-6)

With RAID-5 and RAID-6 and mixed read/write workloads, you should typically determine your optimal stripe width, and then use that number to calculate the appropriate segment size. This can be complicated to do correctly, so it will take me the next few sections to completely explain.

How does RAID-5 really work?

Warning: there be math and binary numbers ahead.

RAID-5 uses parity, a sort of binary checksum, to facilitate drive rebuilds.

Consider the following programming problem that I’ve had asked a few times at job interviews:

You’re given a list of 99 integers from 1 to 100 inclusive. Each integer in the list can occur only one time. Find which integer is missing from the list.

If you’re a math nerd, this should be very straightforward: you take the sum of numbers from 1 to 100, and subtract the sum of all numbers in the list, and you’ll end up with the one that’s missing. After all, we know from middle school algebra that if 5500 – x = 5461, there can only be one value for x.

RAID-5 works on the same principle, but instead of plain addition and subtraction, it uses a special binary operation called XOR (exclusive or). XOR has the following truth table:

XOR 0 1
0 0 1
1 1 0

One way to think of it is that for X XOR Y, if Y if 1 then you flip the value of X.

Wikipedia notes the following important property of exclusive or operations.

If using binary values for true (1) and false (0), then exclusive or works exactly like addition modulo 2.

This is exactly what we’re doing: we’re taking the sum of each bit across the stripe, and throwing out everything except the least-significant digit.

So let’s start with an over-simplified case. We have a bunch of bytes in a RAID block. In real life, a block would be several kilobytes large, but I don’t have room for that in a table. We’ll pretend that each block is one byte instead. The algorithm stays the same.

To calculate the parity for a block, you simply XOR each byte together. In the following chart, Old Parity is the running total up to this point (i.e. each block’s Old Parity is the previous block’s New Parity), and New Parity is the value after XORing each data block into the parity block.

RAID Block Example Byte Old Parity New Parity
Block 1 10101010 00000000 10101010
Block 2 11001100 10101010 01100110
Block 3 11011011 01100110 10111101
Block 4 00010001 10111101 10101100
Parity Block 10101100

Then a disk fails, and we don’t know the contents of one block in the stripe:

RAID Block Example Byte
Block 1 10101010
Block 2 ?
Block 3 11011011
Block 4 00010001
Parity Block 10101100

We just reverse the process and XOR all the remaining numbers together to get our disk’s contents back:

RAID Block Example Byte Block 2 Before XOR Block 2 After XOR
Block 1 10101010 00000000 10101010
Block 3 11011011 10101010 01110001
Block 4 00010001 01110001 01100000
Parity Block 10101100 01100000 11001100
Block 2 11001100

The way that parity is calculated can put all sorts of extra strain on your disks.
Stripe widths and the performance impact of parity

A common misconception among many system administrators is that because most hardware RAID cards perform these XOR operations in hardware using specialized accelerator chips, RAID-5 writes should be fast. This isn’t true; there’s actually substantial disk-level slowdown involved with parity calculations, and those performance hits will never go away.

The layout of data on a RAID array

Recapping the above, a five-disk RAID-5 array might look like this:

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5
Stripe A A1 A2 A3 A4 Ap
Stripe B B1 B2 B3 Bp B4
Stripe C C1 C2 Cp C3 C4
Stripe D D1 Dp D2 D3 D4
Stripe E Ep E1 E2 E3 E4
Stripe F F1 F2 F3 F4 Fp

Boundary crossings

Any of the lines in the above table is a boundary — if you’re reading from or writing to more than one block, you’re making a boundary crossing. Each of these boundary crossings (inter-disk or inter-stripe) incurs a different performance hit, which I’ll describe momentarily.

The parity block of a given stripe has to always be consistent with the data in it. This means that it’s recalculated, updated, and stored again on every single write to the stripe.

Now, recall from the above section that the parity of a block is calculated by XORing together each bit and getting a unique, reversible result. To do this, we remove it from the parity block by XORing the parity block with the written block’s old value. Then, we XOR it with the new value. To do this, though, we need to know the value of the block we’re replacing. In other words: in order to calculate the parity, we need to read each block in the RAID stripe before it’s written. Ouch.

Let’s say you have a 5-disk RAID-5 array (4 data blocks and 1 parity block), with a 32 KiB segment size on each disk, giving you a 128 KiB stripe width. You then write 64 KiB of data to the beginning of the stripe, which is enough to completely overwrite the first two blocks, but not enough to overwrite the entire RAID stripe.

Because you need to read every block you’re writing, your three apparent disk operations (2 data writes and 1 parity write) becomes six operations instead (2 data reads, 1 parity read, 2 data writes, and 1 parity write). You’re literally doubling the amount of I/O to facilitate a single operation. You’re cutting your write performance in half because of the disk I/O overhead in updating the parity block.

(Sidebar: A good RAID implementation will never need to read more than half the disks in a stripe to calculate parity. You can either read the disks you’re writing, and adjust the parity block accordingly, or you can read the disks you aren’t writing and just calculate a new parity block from scratch.)

It’s much easier to throw out the parity information altogether and calculate it from scratch, which we can do when we make full-stripe writes. When writing an entire stripe to disk, the controller already has all of that stripe’s data in memory, and can just calculate the parity without performing any extra reads. Being able to perform nothing but full-stripe writes is the holy grail of RAID-5 write performance, but it can hurt your read performance.

Array sizing for full-stripe write performance

Most performance-sensitive database applications will write blocks or pages that are 2^n bytes large, e.g. 4k, 8k, 32k, and so forth. In an ideal scenario, you want your stripe width to match your write size in order to eliminate stripe boundary crossings and take advantage of full-stripe writes. If you can’t do that because the writes are too small, you want your stripe width to match your per-disk segment size in order to limit the number of disks that have to be read when re-calculating parity.

In order to maximize those full-stripe writes, you have to carefully consider the number of drives in your array, and not just the segment size. If your main application writes randomly in 32k chunks, a 6-disk RAID-5 (with 5 data blocks per stripe) will never be able to have a 32k stripe width. A 5-disk RAID-5 (with 4 data blocks per stripe) can achieve this easily, though, with an 8k segment size.

In order to properly size and stripe your array, you need to do the following things:

  1. Profile your server’s workload to determine your typical write size
  2. Calculate your target stripe width, which should generally be 2^n, based on your typical write size
  3. Figure out what segment size and disk count will get you to that number
This is an ideal. For lots of applications, write sizes can be unpredictable — it’s a fact of life. With luck, a well-designed application, and a good filesystem, hopefully you can minimize these variances.

The big tradeoff

You’ve probably figured out by now that RAID-5 tends to have better write performance when stripe widths are small (but not so small that they cause the controller latency issues), and better read performance when stripe widths are large. You will never, ever get great performance at both. Don’t even try. But hopefully, the several thousand useless words I’ve just spit out on RAID-5 will get you good enough app performance where you won’t want to hang yourself.

Next steps

I’m hoping that Part 3 will cover disk alignment, and Part 4 will cover how to profile your applications on Linux, Solaris and Windows.

3 Comments

  1. Great article ! very very informative and one of the best post on RAID 5.

  2. Hi Jeff, did you ever finish parts 3 and 4? Really enjoyed this, very well explained.

  3. I have noticed you don’t monetize your site, don’t waste your traffic, you can earn additional cash every
    month because you’ve got high quality content. If you want to know how to make extra $$$, search for:
    Ercannou’s essential adsense alternative

Leave a Reply

Your email address will not be published.

© 2019 @jgoldschrafe

Theme by Anders NorenUp ↑