Sustainable by Design: Innovation for Energy Efficiency in Artificial Intelligence, Part 1 | Microsoft Cloud Blog

About how we are progressing towards our sustainability commitments through the Sustainable by design blog series, starting with Sustainable by Design: Advancing the Sustainability of Artificial Intelligence.

Earlier this summer, my colleague Noel Walsh published a blog detailing how we are working to save water in our data center operations: Sustainable Design: Transforming Data Center Water Efficiency, as part of our commitment to sustainability goals To become carbon negative and water positive, zero waste and protect biodiversity.

At Microsoft, we design, build and operate cloud computing infrastructure that spans the entire stack, from data centers to servers to custom silicon. This creates unique opportunities to organize how elements work together to increase performance and efficiency. We see work on energy optimization and energy efficiency as a critical path to meeting our commitment to become carbon negative by 2030 alongside our work to advance carbon-free electricity and decarbonisation.

Sustainable development

Explore our three areas of focus

The rapid growth in demand for AI innovation to power the next frontiers of exploration has provided us with an opportunity to redesign our infrastructure systems, from data centers to servers to silicon, with efficiency and sustainability at the forefront. In addition to providing carbon-free electricity, we innovate at every level of the stack to reduce the energy intensity and power requirements of cloud and AI workloads. Even before electrons enter our data centers, our teams are focused on how we can maximize the computing power we can generate from each kilowatt hour (kWh) of electrical power.

In this blog, I’d like to share examples of how the power and energy efficiency of AI is advancing. It includes a whole-system approach to efficiency and the application of artificial intelligence, particularly machine learning, to manage cloud workloads and artificial intelligence.

Efficiency Efficiency from data centers to servers and silicon

Maximize hardware utilization through intelligent workload management

True to our roots as a software company, one of the ways we increase energy efficiency in our data centers is through software that enables real-time workload scheduling, so we We can maximize the use of existing hardware to meet the demand of the cloud service. For example, we may see more demand when people start their work day in one part of the world, and less demand around the world where others are winding down for the evening. In many cases, we can coordinate availability for internal resource needs, such as running AI training workloads during off-peak hours, using existing hardware that would otherwise be idle during that time period. This also helps us to improve power consumption.

We use the power of software to increase energy efficiency at every level of the infrastructure stack, from data centers to servers to silicon.

Historically, across industry, running AI and cloud computing workloads has relied on the allocation of central processing units (CPUs), graphics processing units (GPUs), and processing power to each team or workload, with CPU and GPU utilization rates of approx. 50% to 60% This leaves some CPUs and GPUs underutilized, potential capacity that could ideally be used for other workloads. To address the usability challenge and improve workload management, we’ve consolidated Microsoft’s AI training workloads into a single suite managed by a machine learning technology called Project Forge.

Application
Project Forge’s global scheduler uses machine learning to virtually schedule training and infer workloads so they can run in time slots when hardware has available capacity, improving utilization rates to 80-90% at scale.

Currently in production across Microsoft Services, the software uses artificial intelligence to virtually schedule training and infer workloads, along with transparent checkpoints that save a snapshot of the current state of a program or model. slow so that it can be stopped and restarted at any time. Whether running on partner silicon or custom Microsoft silicon like Maia 100, Project Forge has consistently increased our performance on Azure to 80-90% utilization at scale.

Safe collection of unused electricity in our data center fleet

Another way to improve energy efficiency is to place smart workloads in a data center to safely harvest unused power. Power harvesting refers to ways that enable us to make the most of our available power. For example, if a workload does not consume all of the power allocated to it, that excess power can be borrowed or even reallocated by other workloads. As of 2020, it has recovered approximately 800 megawatts of electricity from existing data centers, enough to power approximately 2.8 million miles of driving an electric car.1

Over the past year, even as our customer AI workloads have increased, our improvement in energy savings has doubled. We continue to implement these best practices in our data center fleet so we can reclaim and reallocate unused capacity without impacting performance or reliability.

Increasing the efficiency of IT hardware through liquid cooling

In addition to managing the power of workloads, we are focused on reducing the energy and water needed to cool the chips and the servers that house these chips. With the powerful processing of modern AI workloads, heat generation increases, and the use of liquid-cooled servers significantly reduces the power required for thermal management versus air-cooled servers. The move to liquid cooling also enables us to get more performance out of our silicon, as the chips operate more efficiently in their optimal temperature range.

A significant engineering challenge we faced in delivering these solutions was how to update existing data centers designed for air-cooled servers to accommodate the latest advances in liquid cooling. With custom solutions like the “sidekick,” a component that sits adjacent to a rack of servers and circulates liquid like a car radiator, we bring liquid cooling solutions to existing data centers, reducing the energy required for cooling while We increase the rack density. This in turn increases the computing power we can generate from every square foot in our data center.

Learn more and discover resources for cloud productivity and artificial intelligence

Join us to learn more about this, including how we are working to deliver promising efficacy research from the laboratory and commercial operations. You can also read more about how to advance sustainability through the Sustainable by design blog series, starting with Sustainable by design.

For architects, core developers, and IT decision makers who want to learn more about cloud productivity and AI, we recommend checking out the Azure Architecture Sustainability Guide. This set of documentation conforms to the Green Software Foundation design principles and is designed to help customers plan for and meet evolving sustainability requirements and regulations surrounding the development, deployment, and operation of IT capabilities.


1The equivalence assumptions are based on estimates that an electric vehicle can travel an average of about 3.5 miles per kilowatt hour (kWh) x 1 hour at 800.

#Sustainable #Design #Innovation #Energy #Efficiency #Artificial #Intelligence #Part #Microsoft #Cloud #Blog

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top