Agile AI Architectures: Building Fungible Data Centers for the AI Era
Artificial Intelligence (AI) is rapidly transforming every aspect of our lives, from healthcare to software engineering. Innovations like Google’s Magic Cue on the Pixel 10, Nano Banana Gemini 2.5 Flash image generation, Code Assist, and Deepmind’s AlphaFold highlight the advancements made in just the past year. These breakthroughs are powered by equally impressive developments in computing infrastructure.
The exponential growth in AI adoption presents significant challenges for data center design and management. At Google I/O, it was revealed that Gemini models process nearly a quadrillion tokens monthly, with AI accelerator consumption increasing 15-fold in the last 24 months. This explosive growth necessitates a new approach to data center architecture, emphasizing agility and fungibility to manage volatility and heterogeneity effectively.
Addressing the Challenges of AI Growth
Traditional data center planning involves long lead times that struggle to keep pace with the dynamic demands of AI. Each new generation of AI hardware, such as TPUs and GPUs, introduces unique power, cooling, and networking requirements. This rapid evolution increases the complexity of designing, deploying, and maintaining data centers. Furthermore, the need to support various data center facilities, from hyperscale environments to colocation providers across multiple regions, adds another layer of complexity.
To address these challenges, Google, in collaboration with the Open Compute Project (OCP), advocates for designing data centers with fungibility and agility as core principles. Modular architectures, interoperable components, and the ability to late-bind facilities and systems are essential. Standard interfaces across all data center components—power delivery, cooling, compute, storage, and networking—are also crucial.
Power and Cooling Innovations
Achieving agility in power management requires standardizing power delivery and building a resilient ecosystem with common interfaces at the rack level. The Open Compute Project (OCP) is developing technologies like +/-400Vdc designs and disaggregated solutions using side-car power. Emerging technologies such as low-voltage DC power and solid-state transformers promise fully integrated data center solutions in the future.
Data centers are also being reimagined as potential suppliers to the grid, utilizing battery-operated storage and microgrids. These solutions help manage the “spikiness” of AI training workloads and improve power efficiency. Cooling solutions are also evolving, with Google contributing Project Deschutes, a state-of-the-art liquid cooling solution, to the OCP community. Companies like Boyd, CoolerMaster, Delta, Envicool, Nidec, nVent, and Vertiv are showcasing liquid cooling demos, highlighting the industry’s enthusiasm.
Standardization and Open Standards
Integrating compute, networking, and storage in the server hall requires standardization of physical attributes like rack height, width, and weight, as well as aisle layouts and network interfaces. Standards for telemetry and mechatronics are also necessary for building and maintaining future data centers. The Open Compute Project (OCP) is standardizing telemetry integration for third-party data centers, establishing best practices, and developing common naming conventions and security protocols.
Beyond physical infrastructure, collaborations are focusing on open standards for scalable and secure systems:
- Resilience: Expanding manageability, reliability, and serviceability efforts from GPUs to include CPU firmware updates.
- Security: Caliptra 2.0, an open-source hardware root of trust, defends against threats with post-quantum cryptography, while OCP S.A.F.E. streamlines security audits.
- Storage: OCP L.O.C.K. provides an open-source key management solution for storage devices, building on Caliptra’s foundation.
- Networking: Congestion Signaling (CSIG) has been standardized, improving load balancing. Advancements in SONiC and efforts to standardize Optical Circuit Switching are also underway.
Sustainability Initiatives
Sustainability is a key focus. Google has developed a methodology for measuring the environmental impact of AI workloads, demonstrating that a typical Gemini Apps text prompt consumes minimal water and energy. This data-driven approach informs collaborations within the Open Compute Project (OCP) on embodied carbon disclosure, green concrete, clean backup power, and reduced manufacturing emissions.
Community-Driven Innovation
Google emphasizes the power of community collaborations and invites participation in the new OCP Open Data Center for AI Strategic Initiative. This initiative focuses on common standards and optimizations for agile and fungible data centers.
Looking ahead, leveraging AI to optimize data center design and operations is crucial. Deepmind’s AlphaChip, which uses AI to accelerate chip design, exemplifies this approach. AI-enhanced optimizations across hardware, firmware, software, and testing will drive the next wave of improvements in data center performance, agility, reliability, and sustainability.
The future of data centers in the AI era depends on community-driven innovation and the adoption of agile, fungible architectures. By standardizing interfaces, promoting open collaboration, and prioritizing sustainability, the industry can meet the growing demands of AI while minimizing environmental impact. These efforts will unlock new possibilities and drive further advancements in AI and computing infrastructure.
Source: Cloud Blog

