Data Center Cooling: Machine Learning is the Problem and the Solution

The solution to cooling high-density machine learning workloads may be more machine learning.

It’s no secret that rack kW is steadily increasing in the data center, nor is it any wonder why. Processing power is greater than ever and there’s only one direction for it to go: up.

However, the massive, sustained computational power required by machine learning workloads is anything but business as usual. Most data center operators can grapple with gradual increases in IT footprint, but high-density GPU clusters for machine learning raise the stakes, particularly where cooling is concerned.

Perhaps newer data centers, especially those using containment strategies, have the infrastructure to adequately cool what, in some cases, amounts to 30 kW per rack or more. Most older data centers, though, aren’t ready to sustain these requirements. This could prove problematic as artificial intelligence, machine learning and deep learning workloads become more commonplace.

Indeed, some colocation providers that operate older raised floor data centers without hot aisle containment, already serve customers that want to load up their cabinets but lack the ability to cool their desired densities. But the next wave of customers will have an even bigger ask: cooling infrastructure that can support machine learning workloads.

How can this be done efficiently, and cost-effectively?

Fighting Fire with Fire

If there’s one thing we’ve learned from Google in the past year or so, it’s that the solution to cooling high-density machine learning workloads may be more machine learning. The Mountain View giant spent several years testing an algorithm that can learn how to best adjust cooling infrastructure. Consequently, Google yielded a 40-percent reduction in the amount of energy used for cooling. Phase two of that deployment is to put the algorithm on auto-pilot rather than having it make recommendations to human operators.

Clearly, machine learning can and has been used to achieve greater data center cooling efficiency. While most data centers are not yet equipped to do the same, the theory behind how machine learning can optimize cooling efficiency is fairly well understood.

It starts with a PID (proportional integral derivative) loop. This tried and true method helps an industrial system (cooling infrastructure in this case) make real-time adjustments to thermostats by comparing the actual temperature of the data center to the desired temperature so as to calculate an error rate. It then uses that error rate to make a course correction that will yield the desired temperature with the lowest electricity consumption.

PID loops work well; however, they optimize based on a finite set of conditions, and when it comes to data center cooling, there are many conditions that are constantly in flux. This is where machine learning comes into play. Rather than tasking a person with optimizing and re-optimizing based on shifting conditions, an algorithm can monitor PID loops and constantly adjust as needed.

In other words, the PIDs are perpetually configured based on changing factors that influence cooling infrastructure efficiency. Everything from internal humidity, to external weather, to utilization fluctuations within the facility, to interactions between different elements within the cooling infrastructure can influence the desired temperature stability in a high-density data center, and also how efficiently that desired temperature is achieved. It is impractical and costly for a human to constantly optimize PID loops to ensure the most efficient configuration is always in place.

But a machine learning algorithm can. It can theoretically learn the optimal settings for each individual circumstance and apply these adjustments automatically, without human intervention, based on the real-time external and internal conditions. Think of it as auto pilot for data center cooling.

Turning Concept into Reality

Google building an application like this is one thing, but what about other data center operators?

The development and implementation for the type of application we’re describing could come with a colossal upfront cost, one that’s hard to justify for most data center operators – even those with many data centers.

However, developing and training software to act in this way could be a competitive advantage for forward-thinking controls companies. Arguably, in the future, it will be table stakes. Of course, even with such advanced cooling controls, the data center’s physical infrastructure is still important. Legacy data centers using raised floor and inefficient cooling infrastructure have lower limits to capacity and efficiency – regardless of how smart the controls program is.

The sense of urgency for this type of system is nascent. But we know with certainty that the majority of data center operators (67 percent, according to AFCOM) are seeing increasing densities. We also know that machine learning’s power requirements have potential to spur this growth on at a blistering pace in the years ahead.

While we don’t know yet is how we’ll handle this transformation, I suspect that the solution is already right under our noses.

Will Your Data Center Support Growth?Scaling a data center is one of the most challenging tasks an organization can face.

Recent Blog Posts