When we as computer users think of disk performance, we usually think about streaming, sequential performance, otherwise known as throughput. Desktop operating systems have trained us to think in this way, because the most prominent display of disk speed that your average person sees is an Explorer or Finder window showing file copy progress — we know that our music collection is being copied at 25 MB per second, for example. This measurement is a good fit for the task, because it gives us the best approximation of how long it will take until the file copy is finished.
In the server world, though, this generally isn’t how disk performance is measured. Servers are shared resources that do much more complicated things with data than typical desktop systems. Most database access is highly random — you pull a record here, a record there, and piece them together in the application. Rows in a MySQL table are usually no more than a couple of kilobytes each, and the rows you need to join together to service one complex query typically live all over the disk. For most other server-side applications, small files are accessed a lot, and large files are accessed infrequently. So instead of throughput, which is measured in bytes/sec, we typically work with a different measurement called IOPS (pronounced i-ops).
IOPS stands for I/O Operations Per Second, and it refers to the average number of random small reads and writes that a disk drive can perform in one second. Let’s start looking at some numbers and calculating something useful.
IOPS for a single [spinning] disk
Even though SSDs are gaining a lot of traction on servers because they supply a lot of random IOPS, I’m going to ignore them here. Why? Because their electronics are too sophisticated to model like spinning disks. To get the IOPS numbers for an SSD, your best bet is to look up some independent benchmarks. If those are unavailable, run your own. If that’s not doable, well, you’ll just have to trust the numbers that your vendor gives you.
To calculate for a single spinning disk, though, you just need need two numbers: the disk’s rotational speed, and its seek time. Below is a table I’ve shamelessly lifted from Ronny Egner:
|RPM (revolutions per minute||See vendor data sheet||7200||10000||15000|
|RPS (revolutions per second)||RPM / 60 seconds per minute||120||166.67||250|
|RPms (revolutions per millisecond)||RPM / 60000 milliseconds per minute||0.12||0.17||0.25|
|Full rotation time in ms||1 / RPMs||8.33||6||4|
|Avg. rotational latency in ms||½ full rotation time||4.17||3||2|
|Avg. seek time in ms||See vendor data sheet||10||5||4|
|IO time in ms||Avg. rotational latency + avg. seek time||14.17||8||6|
|IOPS||(1 / IO time) * 1000||70.59||125||166.67|
As you can see, an average 7200 RPM disk will net you about 70 IOPS, where a 15,000 RPM disk will give you closer to 170. These numbers can be affected by NCQ, caching, and other drive features. (In a RAID array, many of these drive features are disabled, because the controller does better when performance is more deterministic.)
IOPS for a RAID array
I’m going to ignore caching for simplicity.
Things get complicated here because of the huge variance in the ways that RAID arrays work — I’m sure some people will disagree with the observations I’ve made here. I’m going to assume you have a passing familiarity with the most common RAID levels — if not, Wikipedia has a decent reference. However, I’m going to clarify a handful of definitions I’m going to be using, because the storage industry can’t agree on nomenclature:
- Segment size: The amount of data written to a single disk within a RAID stripe.
- Stripe width: The amount of data contained in a single RAID stripe (segment size × number of data-bearing disks).
When calculating stripe width for RAID-4/5/6, do not include disks used for storing parity.
RAID-0 (striping without mirroring)
RAID-0 doesn’t need to do any special calculations, and it doesn’t need to write anything twice. As a result, all of the disks can be used at the same time for random reads and writes with no penalties. All stripe width calculations should be made with regards to sequential I/O, rather than random I/O.
IOPS: (IOPS per disk × number of disks)
RAID-0+1 (striping with mirroring)
I’m folding RAID-1 in here.
Most controllers will interleave reads between the mirrored disks in the array, which will double your random read performance versus a single disk. Note that this only speeds up random access, because rotational latency rather than seek time is your bottleneck for sequential I/O. Write performance is slightly worse than a single disk because the same data needs to be written to two disks — for a given write, whichever drive has the longer seek time will be slowing you down. Caching usually makes this irrelevant.
Read IOPS: (IOPS per disk × number of disks)
Write IOPS: (IOPS per disk × (number of disks / 2))
RAID-5 (striping with distributed parity)
RAID-5 is a really complicated case that’s heavily reliant on the relationship between your stripe width and your average I/O size, and that complication is why most major database vendors like Oracle will recommend never running on RAID-5.
Read IOPS: (IOPS per disk × (number of disks – 1))
Write IOPS: (IOPS per disk × (number of disks – 1) × RAID-5 write penalty)
The write penalty is a variable scaling factor that varies depending on how well your workload matches your array configuration. At best, there’s virtually no penalty at all. At worst, you might be getting 20% of your expected disk performance. I’m going to go over why that is, and how to optimize your RAID-5 arrays, in Part 2.
(Duncan Epping over at Yellow Bricks has a post where he uses some constant ratios for his RAID write penalties. I have a philosophical disagreement with this idea, but I’ll link to it because my answer of “it depends” isn’t really constructive either.)
RAID-6 (striping with double parity)
Performance characteristics are almost identical to RAID-5, with two significant differences. First, the parity calculations take an order of magnitude more processing power, because the algorithms are much more sophisticated. Second, two disks in each stripe are reserved for parity data — these disks will not contribute to your IOPS.
Read IOPS: (IOPS per disk × (number of disks – 2))
Write IOPS: (IOPS per disk × (number of disks – 2) × RAID-6 write penalty)
Aggregating I/O profiles
One final thing to note: if you have enough concurrent sequential I/O tasks running at the same time, your I/O profile turns from sequential to random. The array is trying to keep these requests from being starved, and slow is usually better than no data at all, so it starts seeking all over the place instead of streaming a nice, even line of consecutive blocks off the disk. Keep this very much in mind when determining which measurement, IOPS vs. sequential throughput, is a more useful measurement for the workload you’re trying to size.
In Part 2, I’ll go over the impact of stripe sizing and how almost everybody does it completely wrong.