In a traditional cloud environment, four node types are common: Controllers, compute nodes, storage nodes, and network nodes. This affords the design some flexibility, but on the surface it looks more complex than a hyperconverged design, where compute nodes also provide storage and networking services. In other words, in a Hyperconverged Infrastructure, we slap Nova Compute, Ceph, and some type of distributed virtual routing all onto a single node.
I will leave the networking piece to a future post, but just looking at storage and compute, you can begin to see where issues can emerge.
Why go Hyperconverged?
The most common reasoning people choose for using a hyperconverged infrastructure is the cost and space savings that arise from using fewer types of hardware and a smaller number of servers. This idea is supplemented by the notion that ‘just putting some storage on computes’ shouldn’t make much of a difference in complexity or performance. After all, hard drives are slow and shouldn’t need much in terms of resources, right? Besides, our cloud is not running at 100% anyway, so we are making some of those free resources work for us.
Not so fast…
While this design looks tempting on the surface, there are a number of things to consider.
Scalability is touted as a strength of hyperconverged, and that is true if the required scale-out ratio for storage and compute happens to match the original design expectations. Unfortunately, that is rarely the case. Furthermore, this ratio needs to take into account not just capacity, but also performance on the storage side.
Let’s look at an actual example here. Let’s say we were going to build a cloud with 20 compute nodes with only boot drives, and 10 storage nodes with 20 drives each. If we were to convert this to a hyperconverged infrastructure, we could simply install 10 drives into each compute node instead of adding the storage nodes. If we do this, however, we now are locked into the “10 drives per compute node” ratio.
If it turns out that the storage capacity is sufficient, but we need to double our compute capacity, we have two choices. We can stick with HCI, but are going to end up with a lot of extra storage capacity. On the other hand, we can always ditch the HCI paradigm and add 20 more compute nodes.
Congratulations, we have just opened an entirely new can of worms by adding dissimilar infrastructure nodes.
Let’s make it even more fun: A new project comes along, which all of sudden needs our storage to be scaled out to 4x capacity. Now we have the rather unappealing options of
- Adding drives to the 20 non-converged nodes we added earlier, plus another 40 nodes of unneeded compute capacity to satisfy storage capacity to end up with a HCI design again. A rather costly option, and it requires reconfiguring existing and active compute nodes.
- Adding drives to the 20 compute nodes we added earlier, and adding 20 storage-only nodes. Now our HCI design is broken the other way, with standalone storage nodes.
- Adding 30 storage-only nodes, thus breaking the HCI design even more, as we now have compute-storage, compute-only and storage-only nodes. On other words, in this case we’d have been better off with separate compute and storage in the first place.
Of course your environment might grow just the right way. Or it might not grow at all. Or you might migrate off it if you have to ever scale out. Either way, I recommend considering this factor very thoroughly before committing to HCI.
We can rephrase this as “HCI uses fewer servers, so it must be cheaper.”
Again, not so fast.
In order to make HCI work, you must dedicate additional resources on each node to the storage infrastructure. This means you have less compute capacity, which you can mitigate either by adding more compute nodes, or by adding more CPU and memory to the existing nodes.
For example, your cloud with 20 compute nodes is designed for 400 instances. Now you are adding disks to these nodes, which eat up 20% of the compute capacity of each nodes. Thus you can spec processors with 20% more cores or add 20% more nodes.
As you were diligent and specified the CPUs with the best cost/performance ratio, adding ‘hotter’ CPUs is going to cost quite a bit more than the 20% extra. Adding 20% more compute nodes also comes with added cost. In many cases storage chassis with low-spec CPUs turn out cheaper than either option.
Now imagine you are operating more than one cloud. Or a cloud and a container environment. Or baremetal. Having a hyperconverged infrastructure won’t stop you from sharing the storage infrastructure from your existing cloud with the newcomers, but you are adding a layer of vulnerability here. If a compute node with added storage (and accessibility from the outside) is compromised, so that storage — even if the storage is being used by a separate cloud. This doesn’t happen a traditional design, where separate storage nodes would have to be specifically compromised.
Also, using storage inside of one cloud to provide resources to another cloud is not a very clean or easy-to-operate design. You can achieve the same goal using a separate storage network and a storage cluster that is outside of all environments with proper separation in order to implement a clean design.
Why go Hyperconverged?
That’s not to say that an HCI environment is never appropriate. For example, some situations where you’d want to seriously consider hyperconverged infrastructure include those where:
- You’re subject to space constraints, especially in satellite locations.
- Given your specific requirements, HCI actually does turn out to be cheaper, and you can live with the scalability and flexibility drawbacks.
- You only need a very small storage cluster, and the number of storage nodes required to build a stable Ceph cluster would significantly add to the cost and would far exceed the storage capacity required.
So … what do we learn from all this?
The most important thing to remember is that you need to examine your use cases closely. Don’t fall for hype, but don’t reject hyperconverged infrastructure just because it is new, either. Make comparable models, ensure you understand the implications, and select the appropriate design.
In other words, resist the pressure from outside telling you to do one or the other ‘because it is clearly the better way.’