< BLOG HOME

Dive into Liquid Cooling for the GPU Era

image

Keeping Cool: How it Became a Data Center Challenge

Data centers have two fundamental problems: How to get power to all the equipment and how to dispose of the heat the power gets converted to. 

In older data centers, power availability often was the bottleneck. At a former employer, we had to buy twice the square footage we needed to get the amount of power we were required for our server farm. While it is nice to have ample space to maneuver around hardware, expanding facility space to access more power is not the best use of data center square footage.

Over time, power distribution improved, which brought up a new problem. With the higher density of the rows of racks, a lot more heat was generated. The bottleneck shifted to the other end — to cooling — especially in regular open-air cages. We shifted from open air to Hot aisle - Cold aisle, which helped remove the bottleneck. Mind you, this was for regular servers that used up a few hundred Watt per machine.


Fast forward to today. CPUs have improved tremendously over the last 5 years, with higher core counts and faster clock speed, but the tradeoff is that today, a single CPU can eat up as much power as a whole server of old. But that’s nothing compared to the massive power consumption of GPUs that are being rapidly adopted worldwide for artificial intelligence. A single server stuffed with GPUs and two high performance CPUs can eat up most of the output of a single 10kW power distribution unit (PDU) all by itself. This poses an entirely new density problem. Essentially, you are blowing your entire “rack power budget” with a single server!  So your options become: (1) Live with a single server (or perhaps two) in a rack or (2) Increase power density to each rack  - which leaves us with the question of how to get the enormous amount of heat from these space heaters out of the data center.

Methods for Cooling Servers

Air cooling

The traditional way to cool servers is inexpensive and not particularly complicated, but also not particularly efficient. Accompanied by the howling of arrays of 40mm fans you get the hot air pumped out of the servers, ideally supported by shrouds which direct the air over the hottest components.

Even when the hot air is pumped out of the server, we are still nowhere near a solution. The traditional way of just pumping air from the data center through the heat exchangers of air conditioners is quite inefficient. The aforementioned hot aisle - cold aisle concept improves the situation some by pumping hotter air through the heat exchanger from the hot aisles and pumping the cold air into the cold aisles. As the throughput of the air conditioners does not change much with this principle, more energy is dissipated, because the hotter air carries more thermal energy per volume.

Liquid cooling

Most data centers are already liquid cooled in the sense that the air conditioners are using liquid refrigerant with indoor heat exchangers to cool the air from the rooms. The servers of course are air cooled with fans blowing air through the server front to back. The hot air rises by convection on the other side, then the air conditioners use yet another air-metal-liquid interface to carry the heat outside.

The biggest drawback of this system is that air does not conduct heat well. The CPU uses a thermal medium to transfer the heat to the heat sink, but the heat sink can only dissipate a certain amount of heat to the air, so if more heat is added to the metal of the heat sink than the air can carry away, the temperature increases in the heat sink and thus also in the CPU die. Furthermore, there are limits to how much heat can be dissipated, so from a certain point on (the thermal saturation point), increasing the air throughput will no longer proportionally increase the amount of heat dissipated .

Conversely, a liquid as the interface between the metal of the heat sink and the coolant can carry several times more heat than air’s thermal saturation point. 

As with air cooling, there are multiple ways to cool servers with liquid. 

  • The simplest way is to keep using air cooled servers, but to move the cooling closer to the servers via liquid cooled in-row cooling units that can more effectively cool the hot air from the hot aisle, and ferry away the heat through coolant to the outside of the data center. This helps to lower the intake temperatures and improve the cooling of the servers. This works with unmodified air cooled servers and brings cooling to the row level.

  • Rear Door Heat Exchangers (RDHx) cool the hot air at the rear door of the rack by transferring the heat to coolant, which is provided by a coolant distribution unit (CDU), so the ambient air is not heated up. The coolant distribution unit contains a liquid-to-liquid heat exchanger, transferring the heat to another coolant circuit that can be connected to existing liquid cooling or to its own cooling towers or heat exchangers outside. This still works with unmodified servers, but brings the cooling to the individual rack level.

  • For “Direct to Chip” cooling, the heat sinks on the targeted semiconductors are replaced by liquid cooling heat exchangers. External coolers pump the coolant through the heat exchangers. This system also works with CDUs, but the refrigerant is pumped directly through the heat exchangers on the semiconductors to be cooled. As air is eliminated from the equation, this system does not depend on ambient temperature anymore, nor on airflow inside the data center. All servers to be cooled with this system must be modified or prebuilt for liquid cooling, with specific heat sinks for liquid cooling on each of the chips to be cooled.

  • Immersion cooling works by placing the entire server into a non-conducting liquid. Typically large tanks are used that contain a large number of servers. The tanks can not be stacked vertically like racks, as this would lead to the servers near the top being cooled much less than the ones on the bottom. Instead, the servers are installed vertically in tanks. You can see an example below:

Comparison of Data Center Cooling Methods: Strengths and Weaknesses

Each of the systems listed has advantages and disadvantages, which must be taken into consideration when architecting a cooling system. 

Method Cooling Capacity Initial Cost Timeframe Operations
Standard Air cooling Lowest - - No extra effort
Hot Aisle - Cold Aisle Low - lower intake temperature, better air circulation Low - Curtains to separate aisles, some reracking, air conditioning duct work Days-Weeks No extra effort
In-Row Medium - Considerably lower intake temperature Medium - Install cooling units and Hot Aisle - Cold Aisle Days-Weeks Maintenance of CDU and fan wall
RDHx Medium High - Heat is transferred outside more effectively Medium-High - Each rack must be upgraded with RDHx, installation of CDU Weeks Maintenance of CDU and RDHx
Direct-to-Chip High - Transition directly from Silicon to metal to coolant High - Every server must be equipped with a number of heat sinks, coolant lines routed, installation of CDU Multiple weeks Server maintenance complicated by coolant lines and heat sinks, maintenance of CDU
Immersion Highest - All components of the server are cooled,  High - Requires construction on the data center in most cases. Racks have to be replaced with tanks. Months Complex - Servers must be taken out of the immersion tank and drained and cleaned before maintenance

However, in many cases it makes sense to combine cooling methodologies. Different technologies can be employed for different use cases, for example Hot Aisle - Cold Aisle for regular compute nodes and immersion for GPU (AI, ML) nodes. 

Furthermore, the cost must be seen in context. Immersion cooling is expensive, but so are the resources cooled by it, so the percentage of cost in relation to the total cost may not be excessive and will be offset by the higher density and thus lower OpEx over time.

Architecture Considerations for Data Center Cooling

There are many factors to consider when selecting the right power and coolant technologies for your data center. 

  • Construction of the current data center - For instance, immersion tanks are extremely heavy because they are filled entirely with a dielectric liquid. The floor pressure limitations of the data center must be taken into account, so the tank won’t end up in the basement.

  • The relationship between power availability and cooling - It’s useless to improve cooling if there’s not enough power to exceed the current cooling envelope. Both factors must be improved proportionally before you can increase the density of compute resources.

  • Existing vs. new hardware - Direct-to-chip cooling, for instance, requires servers that are equipped with the appropriate heat sinks and coolant lines. With existing hardware, the modifications are inordinately expensive if available at all, so in most cases it’s better to choose a different technology.

  • The scale of the deployment - If the whole installation consists of only a few servers, even if these servers are power hungry and require considerable cooling, it may make more sense to lease more floor space and live with the low density than to spend a lot of money on improving the density.

  • Complexity - It’s considerably more time consuming to maintain servers immersed in coolant, and it requires additional skills, which may not be available, especially if you’re leasing space in a data center, and rely on their staff for operations. 

  • Separation from other customers - If only a few servers are to be immersion cooled, you may still need to lease a whole immersion tank as it may not legally be possible to intermingle the servers with the servers from other colocation customers for security reasons.

Conclusion

In the past racks were forklifted into rooms, and the room’s air conditioning had to suffice to cool the systems. Since then, workloads have evolved to require increasingly power hungry (and thus heat producing) equipment, which makes cooling a major factor in maintaining or improving server density and thus conserving costly data center space. 

Liquid cooling has opened an avenue to keep up with the ever increasing heat production of compute equipment, especially in the AI and High Performance Computing (HPC) field, but like most innovations, it also comes at a cost.

More data centers will offer liquid cooling or make it default in the near future, and mainly maintain air cooling for smaller scale, edge, or niche use cases. So it’s time to ready yourself for the shift to liquid cooling, particularly if you are scaling out, adding AI/ML to your corporate portfolio, or building a new platform. If you need expert guidance on your data center cooling systems, Mirantis architects are available to discuss your plans and offer advice. Contact us to schedule a call. 

Additional Reading

Note: The articles listed below are not recommendations or endorsements by Mirantis, but may serve as helpful resources for further research into cooling options for your upcoming project.

Hot Aisle - Cold Aisle design

Open Compute Project: Door Heat Exchanger

Lawrence Berkeley National Laboratory on RDHx

Direct to Chip Cooling by DataCenter Knowledge

Liquid Cooling Options by Vertiv

Christian Huebner

Christian Huebner is Director of Architecture Services at Mirantis, with a focus on AI Infrastructure and Storage. Coming from conventional storage architecture, Christian moved into cloud storage before joining Mirantis and later into general cloud architecture. He currently is spearheading AI infrastructure architecture projects for Mirantis customers with the focus on providing reference architectures and technical assistance for a wide range of AI infrastructure technologies. In addition to AI infrastructure and storage, Christian provides architectural guidance, implementation consulting, and subject matter expertise for a wide variety of customer OpenStack cloud projects.

Mirantis simplifies Kubernetes.

From the world’s most popular Kubernetes IDE to fully managed services and training, we can help you at every step of your K8s journey.

Connect with a Mirantis expert to learn how we can help you.

CONTACT US
k8s-callout-bg.png