Mirantis Acquires Docker Enterprise Platform Business

LEARN MORE

The silent revolution: the point where flash storage takes over for HDD

For years the old paradigm has held true. If you want fast, buy flash, but it is going to cost you. If you want cheap, large HDD based servers were the go-to. The old standby for reasonably fast block storage, the 2TBx24 chassis was ubiquitous. For years, it looked like flash would be relegated to performance tiers. But is this actually true? I’ve been suspicious for some time, but a few days ago I did a comparison for an internal project and what I saw surprised even me. 

Flash storage technology and comparative cost

Flash storage has developed at a breakneck pace in recent years. Not only have devices become more resilient and faster, but there is also a new interface to consider. Traditional SSDs are SATA, or in relatively rare cases, SAS based. This limits the performance envelope of the devices severely. SATA SSDs top out at about 550MB/s maximum throughput, and offer around 50k small file input/output operations per second (IOPS) regardless of the speed of the actual chips inside the device. 

This limitation is due to the data transfer speed of the bus and the need to translate the storage access request to a disk based protocol (SATA/SAS) and, inside the SSD, back to a memory protocol. The same thing happens on the way out when data is being read. 

Enter Non-Volatile Memory express (NVMe). This ‘interface’ is essentially a direct connection of the flash storage to PCIe lanes. A configuration of 4 lanes per NVMe is common, though technology exists to multiplex NVMes so more devices can be attached than there are PCIe lanes available. 

NVMe devices typically top out above 2GB/s, and can offer several hundred thousand IOPS – theoretically. They also consume a lot more CPU when operating in a software defined storage environment, which limits performance somewhat. However, in practical application they are still much faster than traditional SSDs – at what is usually a very moderate cost delta. 

If the performance of SATA SSDs is insufficient for a specific use case, moving to SAS SSDs is usually not worth the expense, as NVMe devices, which offer much better performance, are usually not more expensive than their SAS counterparts, so a move directly to NVMe is preferable.

One more note: If NVMes operate with the same number of CPU cores as SSDs, they are still somewhat faster and very comparable financially. The calculations below are designed to include more CPU cores for NVMe for performance applications.

Let’s look at how the numbers work out for different situations.

Small Environments

Let’s have a look at a 100TB environment with increasing performance requirements. Consider the following table that looks at HDDs, SSDs, and NVMe. Street prices are in US$x1000, and IOPS are rough estimates:

100TB HDD 6TB 12/4U cost [x1000 US$] HDD 2TB 20/2U cost [x1000 US$] SSD

Layout

SSD cost [x1000 US$] NVMe

Layout

NVMe cost [x1000 US$]
10k IOPS 132 135 5X 10x 7.68TB 91 5x 4x 15.36TB 102
30k IOPS 345 271 5X 10x 7.68TB 91 5x 4x 15.36TB 102
50k IOPS 559 441 5X 10x 7.68TB 91 5x 4x 15.36TB 102
100k IOPS 1,117 883 5X 10x 7.68TB 91 5x 4x 15.36TB 102
200k IOPS 2,206 1,767 7x 14x 3.84TB 113 5x 10x 7.68TB 113
500k IOPS 5,530 4,419 14x 14x 1.92TB 168 7x 14x 3.84TB 133
1000k IOPS 11,034 8,804 42x 14x 1.92TB 470 13x 14x 2TB  168

In this relatively small cluster, as expected, HDDs are no longer viable. The more IOPS required, the more extra capacity must be purchased to provide enough spindles. This culminates in a completely absurd $11 million for a 1000K IOPS cluster built on 6TB hard disks.

Middle of the Road

Of course, we all know that larger amounts of SSD storage are more expensive, so let’s quadruple storage requirements and see where we get. HDDs should become more viable, wouldn’t you think?

400TB HDD 6TB 12/4U HDD 2TB 20/2U SSD

Layout

SSD cost [x1000 US$] NVMe

Layout

NVMe cost [x1000 US$]
10k IOPS 250 510 14x 14x 7.68TB 326 7x 14x 15.36TB 348
30k IOPS 405 510 14x 14x 7.68TB 326 7x 14x 15.36TB 348
50k IOPS 655 510 14x 14x 7.68TB 326 7x 14x 15.36TB 348
100k IOPS 1,311 883 14x 14x 7.68TB 326 7x 14x 15.36TB 348
200k IOPS 2,593 1,767 14x 14x 7.68TB 326 7x 14x 15.36TB 348
500k IOPS 6,495 4,419 14x 14x 7.68TB 326 7x 14x 15.36TB 348
1000k IOPS 12,961 8,804 27x 14x 3.84TB 437 14x 14x 7.68TB 413

Surprise! Again we find that HDD is only viable for the slower speed requirements of archival storage. Note that 16TB NVMes are not much more expensive than the SSD solution!

A note about chassis: To get good performance out of NVMe devices, a lot more CPU cores are needed than in HDD based solutions. Four OSDs per NVMe and 2 cores per OSD are a rule of thumb. This means that stuffing 24 NVMes into a 2U chassis and calling it a day is not going to provide exceptional performance.  We recommend 1U chassis with 5-8 NVMe devices to reduce bottlenecking on the OSD code itself. (I’m also assuming that the network connectivity is up to transporting the enormous amount of data traffic.)

Petabyte Scale

If we enter petabyte scale, hard disks become slightly more viable, but at this scale (we are talking 64 4U nodes) the sheer physical size of the hard disk based cluster can become a problem:

1PB HDD 6TB 12/4U HDD 2TB 20/2U SSD

Layout

SSD cost [x1000 US$] NVMe

Layout

NVMe cost [x1000 US$]
10k IOPS 453 1257 34x 14x 7.68TB 789 17x 14x 15.36tb 871
30k IOPS 453 1257 34x 14x 7.68TB 789 17x 14x 15.36tb 871
50k IOPS 488 1257 34x 14x 7.68TB 789 17x 14x 15.36tb 871
100k IOPS 1850 1257 34x 14x 7.68TB 789 17x 14x 15.36tb 871
200k IOPS 2101 1767 34x 14x 7.68TB 789 17x 14x 15.36tb 871
500k IOPS 4619 4365 34x 14x 7.68TB 789 17x 14x 15.36tb 871
1000k IOPS 10465 8720 34x 14x 7.68TB 789 17x 14x 15.36tb 871

Note: The performance data for SSD and NVMe OSDs is estimated conservatively. Depending on the use case performance will vary.

So what do we learn from all this? 

The days of HDD are numbered. 

For most use cases even today SSD is superior. Also, SSD and NVMe are still nosediving in terms of cost/unit. SSD/NVMe based nodes also make for much more compact installations and are a lot less vulnerable to vibration, dust and heat. 

The health question

Of course, cost isn’t the only issue. SSDs do wear. The current crop is way more resilient long term than SSDs from a couple of years ago, but they will still eventually wear out. On the other hand, SSDs are not prone to sudden catastrophic failure triggered by either a mechanical event or marginal manufacturing tolerances.

That means that the good news is that SSDs do not fail suddenly in almost all cases. They develop bad blocks, which for a time are replaced with fresh blocks from an invisible capacity reserve on the device. You will not see capacity degradation until the capacity reserve runs out of blocks, wear leveling does all this automatically.

You can check the health of the SSDs by using SMART (smartmontools on Linux), which will show how many blocks have been relocated, and the relative health of the drive as a percentage of the overall reserve capacity.  

Bonus round: SSD vs 10krpm

In the world of low latency and high IOPS, the answer for HDD manufacturers is to bump up rotation speed of the drives. Unfortunately, while this does make them faster, it also makes them more mechanically complex, more thermally stressed and in a word: expensive.

SSDs are naturally faster and mechanically simple. They also — traditionally at least — were more expensive than the 10krpm disks, which is why storage providers have still been selling NASes and SANs with 10 or 15krpm disks. (I know this from experience, as I used to run high performance environments for a web content provider.)

Now have a look at this:

Device type Cost [US$] Cost/GB [US$]
HDD 1.8TB SAS 10krpm Seagate 370 0.21
SSD 1.92TB Micron SATA 335 0.17
NVMe 2.0TB Intel 399 0.20
HDD 0.9TB Seagate 349 0.39

In other words, 10krpm drives are obsolete not only from the cost/performance ratio, but even from the cost/capacity ratio! The 15krpm drives are even worse. The hard disks in this sector have no redeeming qualities; they are more expensive, drastically slower, more mechanically complex, and cost enormous amounts of money to run.

So why is there so much resistance to moving beyond them? I have heard the two arguments against SSDs:

Lifespan: With today’s wear leveling, this issue has largely evaporated. Yes, it is possible to wear out an SSD, but have a look at the math: a read optimized SSD is rated for about one Device Writes Per Day (DWPD) (that is, a write of the whole capacity of the device) over 5 years. Let’s compare this with an 1.8TB 10krpm HDD. With a workload that averages out at 70MB/s write (with a mix of small and large writes) and a 70/30 read/write ratio, this 10krpm HDD can write 1.81TB/day. 

In other words, you won’t wear out the SSD under the same conditions within 5 years. If you want to step up to 3xDWPD (mixed use), the drive is still below the cost for the HDD (US$350), and you will have enough resilience even for very write heavy workloads.

TCO: It is true that an SSD uses more power as throughput increases. Most top out at about 3x the power consumption of a comparable HDD if they are driven hard. They also will provide ~10x the throughput and >100x the small block IOPS of the HDD. If the SSD is ambling along at the sedate pace of the 10krpm HDD, it will consume less power than the HDD. If you stress the performance envelope of the SSD, you would have to have a large number of HDDs to match the single SSD performance, which would not even be in the same ballpark in both initial cost and TCO.

In other words, imagine having to put up a whole node with 20 HDDs to match the performance of this single $350 mixed use SSD that consumes 20W at full tilt operation. You would have to buy a $4000 server with 20 $370 HDDs — which would, by the way, consume an average of maybe 300W. 

So as you can see, an SSD is the better deal, even from a purely financial perspective, whether you drive it hard or not.

Of course there are always edge cases. Ask us about your specific use case, and we can do a tailored head-to-head comparison for your specific application.

So what’s next?

We are already nearing the point where NVMe will supersede the SATA or SAS interface in SSDs. So the SSDs, which came out on top when we started this discussion, are already on their way out again.

NVMe has the advantage of being an interface created specifically for flash memory. It does not pretend to be on a HDD, as SAS and SATA do, thus it does not need the protocol translation needed to transform the HDD access protocol internally into a flash memory access protocol and transform back on the way out. You can see the difference by looking at the performance envelope of the devices. 

New flash memory technologies push the performance envelope and the interface increasingly hampers performance, so the shift from SAS/SATA to NVMe is imminent. NVMe even comes in multiple form factors, with one closely resembling 2.5” HDDs for hotswap purposes, and one (m.2, which resembles a memory module) for internal storage that does not need hot swap capability. Intel’s ruler design and Supermicro’s m.2 carriers will further increase storage density with NVMe devices.

On the horizon, new technologies such as Intel Optane again increase performance and resilience to wear, currently still much higher cost to traditional flash modules.  

Maybe a few years from now everything is going to be nonvolatile memory and we simply can cut power to the devices. Either way, we will see further increases in density, performance and reliability and further decrease in cost. 

Welcome to the future of storage!

Christian Huebner (@ossarchitect on Twitter)

With an MS in Electrical Engineering and a passion for software Christian was drawn into IT right out of university. Stations as a technical instructor, senior support and systems engineer, developer and architect followed, almost always in mission critical Unix or Linux based IT environments. Every station has contributed understanding of additional technologies and their impact on key properties such as reliability, performance, scalability, and ease of use.

As principal architect at Mirantis, technology is still at the heart of everything Christian does, but he feels the important part is to understand, manage and optimize the relationship between business aspects and the technologies to best support it. Thus, a good part of his time is spent determining business impact on project design and technology influence on business decisions. 

LIVE DEMO
How to Use Service Mesh with VMs and Containers
REGISTER