Blog Article
A Guide to Liquid Cooling in the Age of AI
With widespread investment into artificial intelligence, and the rapid advances in GPU processing power, liquid cooling is quickly becoming the most practical and most sustainable model of data center cooling.
Demanding Workloads Call for Innovative Cooling Solutions
As the world embraces artificial intelligence (AI), liquid cooling for high-performance computing is fast becoming a necessity for many deployments. Many of today’s AI applications (especially LLMs) must be developed and hosted on HPC servers, which require high power densities. Where traditional IT deployments utilize, on average, 8-10 kW of energy per rack, deployments dedicated to AI applications can utilize up to 132 kW per cabinet.
But with these higher power draws, servers dedicated to AI applications also generate higher heat loads. Without adequate cooling, these servers will quickly overheat and shut down. At this stage, we’ve reached the point where air cooling by itself is not always the most energy efficient way to remove the massive amounts of heat generated by the microprocessor chips needed for AI applications.
Also, it takes considerable power for data centers to continuously operate mechanical cooling and fans. A study by the American Council for an Energy-Efficient Economy found that chillers and HVAC systems account for 25% of data center energy use. Today, that power is sorely needed for high performance compute and artificial intelligence application infrastructure.
It’s well-known that liquid is more effective at absorbing and conducting heat than air-cooled solutions. In addition to safe and effective heat removal, today’s liquid cooling systems have several distinctions. Unlike air cooling, which must be set up to chill an entire room, liquid cooling can be installed on specific IT deployments. Liquid cooling systems require less operational power, which can be redirected to servers dedicated to LLM or other AI applications with reduced operational costs for data centers.
In this article, we’ll take a closer look at available options for liquid cooling, including the benefits and drawbacks of each.
Direct-to-Chip Liquid Cooling
In direct-to-chip cooling, a CPU or GPU chip sits directly on top of a cold plate inside a server. As liquid coolant is pumped through the cold plate, it absorbs heat directly from the chip. The liquid then exits the server through sealed tubing and flows to a heat exchanger (typically located in a coolant distribution unit (CDU)) that discharges the heat to a facility system.
But direct-to-chip cooling is still a hybrid system. The liquid coolant can only remove about 80% of the heat load. Air cooling provided by fans is still needed to remove the other 20%, which can drive up energy usage and cost. Also, installing the piping system for a direct-to-chip liquid cooling system in a legacy data center can be expensive.
Immersion Liquid Cooling
In immersion cooling, HPC components are completely submerged in a dielectric (non-electrical conducting) cooling liquid. The forms of immersion cooling include:
Single-Phase Immersion Cooling
In single-phase immersion cooling, electrical components are immersed in a hydrocarbon-based coolant that removes heat from the chips. The coolant is then pumped to a heat exchanger, where a cooler water circuit removes the heat from the liquid. During this process, the liquid remains a liquid – it doesn’t change its phase or form. No air cooling fans are required, which eliminates their energy usage and operational costs.
The tanks must be deployed with adequate space between them, on a reinforced floor designed to support their weight. Also, maintenance becomes very messy. Each time a server requires servicing or parts replacement, the tank it sits in must be drained and the server components must be removed from the oily liquid. What’s more, there’s also the looming concern the material compatibility with the components immersed in the fluid.
Two-Phase Immersion Cooling
In two-phase immersion cooling, HPC components are submerged in a specially-engineered fluorocarbon liquid with a low boiling point (e.g. under 120°F/50°C). The heat from the chips boils the liquid coolant, turning it into a heated gas, which rises to a condenser coil that removes the heat. The cooled gas then condenses and returns to the larger liquid volume. Two-phase coolants can support extreme rack densities of up to 250 kW per rack!
But two-phase cooling is a more complex system and therefore has higher up-front costs. The fluorocarbon-based liquid coolant is expensive, especially if you are using, say, 2,000 gallons for 300 HPC servers. Each time the tanks are drained to provide maintenance, any evaporated coolant must be replaced when the tanks are filled again, at significant cost. Additionally, the coolant easily lost during server maintenance, escaping the tank when it’s open.
- Flooded Chassis- This is a form of single-phase or two-phase immersion cooling, in which components are submerged in dielectric fluid inside a leak-proof aluminum chassis. Again, this makes for effective heat removal, but the fluid piping system has a higher capital cost. Also, the number of providers offering this solution is limited, so supply chain issues may be a problem.
Looking at Liquid Cooling Options with Data Center Providers
There’s no “one right solution” for liquid cooling. The method you choose depends on your HPC servers and AI needs. But if you’re looking to deploy your HPC infrastructure in a facility owned by a data center provider, you should look at what the provider offers in terms of liquid cooling options.
Forward thinking data center developers will be building facilities with the power and infrastructure to support and cool servers for any AI application. A provider should offer built-to-suit options for their clients. They should be willing to build out your HPC infrastructure using whichever liquid cooling option you prefer, single- or two-phase immersion cooling or direct-to-chip cooling. If you’re not sure which form of liquid cooling to use, the provider should make recommendations based on your HPC needs. Additionally, your provider, like Sabey, should have people on their team dedicated to tenant improvements/fit-outs that include liquid cooled solutions (ask us about our Customer Solutions Specialists).
Finally, providers should have the experience and operational maturity to handle the installation and maintenance of infrastructure of this complexity and cost. Couple that with a provider who fosters long term relationships with their partners, vendors and customers, and you can drastically reduce the risk of construction, installation or supply chain delay.
Contact Sabey Data Centers today to ask about liquid cooling options for AI deployments.