For years the old paradigm has held true. If you want fast, buy flash, but it is going to cost you. If you want cheap, large HDD based servers were the go-to. The old standby for reasonably fast block storage, the 2TBx24 chassis was ubiquitous. For years, it looked like flash would be relegated to performance tiers. But is this actually true? I’ve been suspicious for some time, but a few days ago I did a comparison for an internal project and what I saw surprised even me.
Flash storage technology and comparative cost
Flash storage has developed at a breakneck pace in recent years. Not only have devices become more resilient and faster, but there is also a new interface to consider. Traditional SSDs are SATA, or in relatively rare cases, SAS based. This limits the performance envelope of the devices severely. SATA SSDs top out at about 550MB/s maximum throughput, and offer around 50k small file input/output operations per second (IOPS) regardless of the speed of the actual chips inside the device.
This limitation is due to the data transfer speed of the bus and the need to translate the storage access request to a disk based protocol (SATA/SAS) and, inside the SSD, back to a memory protocol. The same thing happens on the way out when data is being read.
Enter Non-Volatile Memory express (NVMe). This ‘interface’ is essentially a direct connection of the flash storage to PCIe lanes. A configuration of 4 lanes per NVMe is common, though technology exists to multiplex NVMes so more devices can be attached than there are PCIe lanes available.
NVMe devices typically top out above 2GB/s, and can offer several hundred thousand IOPS – theoretically. They also consume a lot more CPU when operating in a software defined storage environment, which limits performance somewhat. However, in practical application they are still much faster than traditional SSDs – at what is usually a very moderate cost delta.
If the performance of SATA SSDs is insufficient for a specific use case, moving to SAS SSDs is usually not worth the expense, as NVMe devices, which offer much better performance, are usually not more expensive than their SAS counterparts, so a move directly to NVMe is preferable.
One more note: If NVMes operate with the same number of CPU cores as SSDs, they are still somewhat faster and very comparable financially. The calculations below are designed to include more CPU cores for NVMe for performance applications.
Let’s look at how the numbers work out for different situations.
Let’s have a look at a 100TB environment with increasing performance requirements. Consider the following table that looks at HDDs, SSDs, and NVMe. Street prices are in US$x1000, and IOPS are rough estimates:
|100TB||HDD 6TB 12/4U cost [x1000 US$]||HDD 2TB 20/2U cost [x1000 US$]||SSD
|SSD cost [x1000 US$]||NVMe
|NVMe cost [x1000 US$]|
|10k IOPS||132||135||5X 10x 7.68TB||91||5x 4x 15.36TB||102|
|30k IOPS||345||271||5X 10x 7.68TB||91||5x 4x 15.36TB||102|
|50k IOPS||559||441||5X 10x 7.68TB||91||5x 4x 15.36TB||102|
|100k IOPS||1,117||883||5X 10x 7.68TB||91||5x 4x 15.36TB||102|
|200k IOPS||2,206||1,767||7x 14x 3.84TB||113||5x 10x 7.68TB||113|
|500k IOPS||5,530||4,419||14x 14x 1.92TB||168||7x 14x 3.84TB||133|
|1000k IOPS||11,034||8,804||42x 14x 1.92TB||470||13x 14x 2TB||168|
In this relatively small cluster, as expected, HDDs are no longer viable. The more IOPS required, the more extra capacity must be purchased to provide enough spindles. This culminates in a completely absurd $11 million for a 1000K IOPS cluster built on 6TB hard disks.
Middle of the Road
Of course, we all know that larger amounts of SSD storage are more expensive, so let’s quadruple storage requirements and see where we get. HDDs should become more viable, wouldn’t you think?
|400TB||HDD 6TB 12/4U||HDD 2TB 20/2U||SSD
|SSD cost [x1000 US$]||NVMe
|NVMe cost [x1000 US$]|
|10k IOPS||250||510||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|30k IOPS||405||510||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|50k IOPS||655||510||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|100k IOPS||1,311||883||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|200k IOPS||2,593||1,767||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|500k IOPS||6,495||4,419||14x 14x 7.68TB||326||7x 14x 15.36TB||348|
|1000k IOPS||12,961||8,804||27x 14x 3.84TB||437||14x 14x 7.68TB||413|
Surprise! Again we find that HDD is only viable for the slower speed requirements of archival storage. Note that 16TB NVMes are not much more expensive than the SSD solution!
A note about chassis: To get good performance out of NVMe devices, a lot more CPU cores are needed than in HDD based solutions. Four OSDs per NVMe and 2 cores per OSD are a rule of thumb. This means that stuffing 24 NVMes into a 2U chassis and calling it a day is not going to provide exceptional performance. We recommend 1U chassis with 5-8 NVMe devices to reduce bottlenecking on the OSD code itself. (I’m also assuming that the network connectivity is up to transporting the enormous amount of data traffic.)
If we enter petabyte scale, hard disks become slightly more viable, but at this scale (we are talking 64 4U nodes) the sheer physical size of the hard disk based cluster can become a problem:
|1PB||HDD 6TB 12/4U||HDD 2TB 20/2U||SSD
|SSD cost [x1000 US$]||NVMe
|NVMe cost [x1000 US$]|
|10k IOPS||453||1257||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|30k IOPS||453||1257||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|50k IOPS||488||1257||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|100k IOPS||1850||1257||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|200k IOPS||2101||1767||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|500k IOPS||4619||4365||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
|1000k IOPS||10465||8720||34x 14x 7.68TB||789||17x 14x 15.36tb||871|
Note: The performance data for SSD and NVMe OSDs is estimated conservatively. Depending on the use case performance will vary.
So what do we learn from all this?
The days of HDD are numbered.
For most use cases even today SSD is superior. Also, SSD and NVMe are still nosediving in terms of cost/unit. SSD/NVMe based nodes also make for much more compact installations and are a lot less vulnerable to vibration, dust and heat.
The health question
Of course, cost isn’t the only issue. SSDs do wear. The current crop is way more resilient long term than SSDs from a couple of years ago, but they will still eventually wear out. On the other hand, SSDs are not prone to sudden catastrophic failure triggered by either a mechanical event or marginal manufacturing tolerances.
That means that the good news is that SSDs do not fail suddenly in almost all cases. They develop bad blocks, which for a time are replaced with fresh blocks from an invisible capacity reserve on the device. You will not see capacity degradation until the capacity reserve runs out of blocks, wear leveling does all this automatically.
You can check the health of the SSDs by using SMART (smartmontools on Linux), which will show how many blocks have been relocated, and the relative health of the drive as a percentage of the overall reserve capacity.
Bonus round: SSD vs 10krpm
In the world of low latency and high IOPS, the answer for HDD manufacturers is to bump up rotation speed of the drives. Unfortunately, while this does make them faster, it also makes them more mechanically complex, more thermally stressed and in a word: expensive.
SSDs are naturally faster and mechanically simple. They also — traditionally at least — were more expensive than the 10krpm disks, which is why storage providers have still been selling NASes and SANs with 10 or 15krpm disks. (I know this from experience, as I used to run high performance environments for a web content provider.)
Now have a look at this:
|Device type||Cost [US$]||Cost/GB [US$]|
|HDD 1.8TB SAS 10krpm Seagate||370||0.21|
|SSD 1.92TB Micron SATA||335||0.17|
|NVMe 2.0TB Intel||399||0.20|
|HDD 0.9TB Seagate||349||0.39|
In other words, 10krpm drives are obsolete not only from the cost/performance ratio, but even from the cost/capacity ratio! The 15krpm drives are even worse. The hard disks in this sector have no redeeming qualities; they are more expensive, drastically slower, more mechanically complex, and cost enormous amounts of money to run.
So why is there so much resistance to moving beyond them? I have heard the two arguments against SSDs:
Lifespan: With today’s wear leveling, this issue has largely evaporated. Yes, it is possible to wear out an SSD, but have a look at the math: a read optimized SSD is rated for about one Device Writes Per Day (DWPD) (that is, a write of the whole capacity of the device) over 5 years. Let’s compare this with an 1.8TB 10krpm HDD. With a workload that averages out at 70MB/s write (with a mix of small and large writes) and a 70/30 read/write ratio, this 10krpm HDD can write 1.81TB/day.
In other words, you won’t wear out the SSD under the same conditions within 5 years. If you want to step up to 3xDWPD (mixed use), the drive is still below the cost for the HDD (US$350), and you will have enough resilience even for very write heavy workloads.
TCO: It is true that an SSD uses more power as throughput increases. Most top out at about 3x the power consumption of a comparable HDD if they are driven hard. They also will provide ~10x the throughput and >100x the small block IOPS of the HDD. If the SSD is ambling along at the sedate pace of the 10krpm HDD, it will consume less power than the HDD. If you stress the performance envelope of the SSD, you would have to have a large number of HDDs to match the single SSD performance, which would not even be in the same ballpark in both initial cost and TCO.
In other words, imagine having to put up a whole node with 20 HDDs to match the performance of this single $350 mixed use SSD that consumes 20W at full tilt operation. You would have to buy a $4000 server with 20 $370 HDDs — which would, by the way, consume an average of maybe 300W.
So as you can see, an SSD is the better deal, even from a purely financial perspective, whether you drive it hard or not.
Of course there are always edge cases. Ask us about your specific use case, and we can do a tailored head-to-head comparison for your specific application.
So what’s next?
We are already nearing the point where NVMe will supersede the SATA or SAS interface in SSDs. So the SSDs, which came out on top when we started this discussion, are already on their way out again.
NVMe has the advantage of being an interface created specifically for flash memory. It does not pretend to be on a HDD, as SAS and SATA do, thus it does not need the protocol translation needed to transform the HDD access protocol internally into a flash memory access protocol and transform back on the way out. You can see the difference by looking at the performance envelope of the devices.
New flash memory technologies push the performance envelope and the interface increasingly hampers performance, so the shift from SAS/SATA to NVMe is imminent. NVMe even comes in multiple form factors, with one closely resembling 2.5” HDDs for hotswap purposes, and one (m.2, which resembles a memory module) for internal storage that does not need hot swap capability. Intel’s ruler design and Supermicro’s m.2 carriers will further increase storage density with NVMe devices.
On the horizon, new technologies such as Intel Optane again increase performance and resilience to wear, currently still much higher cost to traditional flash modules.
Maybe a few years from now everything is going to be nonvolatile memory and we simply can cut power to the devices. Either way, we will see further increases in density, performance and reliability and further decrease in cost.
Welcome to the future of storage!