The shift towards Liquid Cooling in data centres

The shift towards Liquid Cooling in data centres - the necessity for AI and sustainability

The world of data centres is evolving rapidly, and one of the most significant shifts in recent years has been the growing adoption of liquid cooling technologies. Traditionally, servers from original equipment manufacturers (OEMs) and original design manufacturers (ODMs) have relied on air cooling methods, using fans and heat sinks to dissipate heat from CPUs and GPUs. While effective for standard computing tasks, these conventional systems are no longer sufficient to meet the demands of modern high-performance workloads, particularly those powered by AI chips.

AI hardware, especially modern GPUs and specialised chips like TPUs, generates tremendous amounts of heat due to their increased power density. In fact, some of the newest AI chips won’t even power up unless cooled by liquid, pushing cloud service providers (CSPs) into a race to adopt liquid cooling solutions to ensure their infrastructure can handle the heat – literally.

However, liquid cooling is not a one-size-fits-all solution. There are different approaches, each with its own benefits and drawbacks. Let’s explore the main types of liquid cooling and how they compare.

Types of Liquid Cooling

Direct-to-Chip Cooling (Cold Plate Cooling)

Direct-to-chip cooling, often called cold plate cooling, is a popular form of liquid cooling used in data centres. In this method, coolant is pumped through pipes to the cold plates attached directly to the heat-generating components, such as CPUs and GPUs. The coolant absorbs heat and is then circulated away to be cooled and recirculated.

Pros

Directly cools the components that generate the most heat, ensuring efficient thermal management. Direct-to-chip is often easier to integrate into existing data centre architectures compared to immersion cooling systems.

Cons

It requires a more complex plumbing system within the server infrastructure to ensure the coolant reaches every chip effectively. While more efficient than air cooling, DTC still requires energy to pump the liquid and cool it afterwards, potentially leading to higher operational costs. DTC may struggle to cool ultra-high-density environments compared to immersion cooling methods. Also, this technique doesn’t cool other high heaters on the IT server, like network cards or power suppliers.

Liquid Immersion Cooling (Single-Phase and Two-Phase)

Liquid immersion cooling submerges the entire server, or a portion of the server, in a dielectric (non-conductive) liquid. There are two main types: single-phase and two-phase immersion cooling.

Single-phase immersion cooling

In this method, servers are fully immersed in coolant that circulates to absorb heat. The liquid doesn’t change its state; it remains a liquid throughout the cooling process.

Two-phase immersion cooling

The liquid changes state (from liquid to vapour) when it absorbs heat from the servers. The vapour rises, is cooled down back into liquid, and then recirculated. Worth mentioning that this method is less popular due to system design complexities, liquid loss through evaporation, and the liquid itself, which could be harmful when inhaled.

Pros

Immersion cooling can handle extremely high power densities, making it ideal for AI and other compute-heavy tasks. This cooling technique eliminates the need for fans and air conditioners, significantly cutting down energy consumption. Since components are submerged in a stable environment without airflow, they experience less wear from dust and environmental factors, potentially increasing their longevity. Worth adding that immersion cooling (single-phase) offers almost silent operation.

Cons

Immersion systems tend to be more expensive to set up compared to direct-to-chip cooling and can make it more difficult to access servers for maintenance, requiring more specialised skills and equipment. Depending on the tank(s) size, this cooling method can take up more space in a data centre, which could be a constraint in high-density environments.

Hybrid Cooling Systems

Some data centres use a combination of air, direct-to-chip, and immersion cooling to optimise for different workloads and hardware configurations. For example, low-density servers might still use air cooling, while high-performance AI workloads could rely on immersion cooling for maximum efficiency.

Pros

Hybrid systems allow for a mix of cooling methods based on the specific needs of the data centre, optimising cost and performance. Hybrid cooling can grow as the needs of the data centre evolve, allowing for flexibility in the types of hardware deployed.

Cons

Integrating multiple cooling systems increases the complexity of data centre design, monitoring, and maintenance. While efficient, the combination of multiple cooling methods can lead to higher upfront capital expenditure and operational expenses.

The Necessity of Liquid Cooling for AI and Sustainability

AI workloads are among the driving forces behind the shift towards liquid cooling. AI chips, particularly GPUs and TPUs, generate immense heat due to their power-hungry operations. In many cases, these chips cannot even power on without liquid cooling due to the thermal constraints they place on traditional cooling systems. As AI becomes more embedded in applications from cloud computing to autonomous vehicles, the demand for liquid cooling is only going to increase.

Liquid cooling not only enables higher-density computing but also plays a significant role in improving the sustainability of data centres. By significantly reducing the need for energy-hungry fans and traditional air conditioning systems, liquid cooling methods can reduce a data centre’s overall power consumption. This is crucial as the world increasingly focuses on energy efficiency and carbon reduction in data centres.

Conclusion

As AI continues to push the boundaries of what data centres can achieve, liquid cooling has gone from a niche technology to a critical part of modern infrastructure. Whether it’s direct-to-chip or full immersion systems, the race for liquid cooling adoption is on, and the winners will be the CSPs that can provide high performance while minimising energy use and maximising cooling efficiency. For cloud providers already operating at scale, the shift to liquid cooling is inevitable if they hope to stay competitive in a future where AI workloads are only becoming more intensive.