Tag: Agile Architectures

  • Agile AI: Google’s Fungible Data Centers for the AI Era

    Agile AI: Google’s Fungible Data Centers for the AI Era

    Agile AI Architectures: A Fungible Data Center for the Intelligent Era

    Artificial intelligence (AI) is rapidly transforming every aspect of our lives, from healthcare to software engineering. Google has been at the forefront of these advancements, showcasing developments like Magic Cue on the Pixel 10, Nano Banana Gemini 2.5 Flash image generation, Code Assist, and AlphaFold. These breakthroughs are powered by equally impressive advancements in computing infrastructure. However, the increasing demands of AI services require a new approach to data center design.

    The Challenge of Dynamic Growth and Heterogeneity

    The growth in AI is staggering. Google reported a nearly 50X annual growth in monthly tokens processed by Gemini models, reaching 480 trillion tokens per month, and has since seen an additional 2X growth, hitting nearly a quadrillion monthly tokens. AI accelerator consumption has grown 15X in the last 24 months, and Hyperdisk ML data has grown 37X since GA. Moreover, there are more than 5 billion AI-powered retail search queries per month. This rapid growth presents significant challenges for data center planning and system design.

    Traditional data center planning involves long lead times, but AI demand projections are now changing dynamically and dramatically, creating a mismatch between supply and demand. Furthermore, each generation of AI hardware, such as TPUs and GPUs, introduces new features, functionalities, and requirements for power, rack space, networking, and cooling. The increasing rate of introduction of these new generations complicates the creation of a coherent end-to-end system. Changes in form factors, board densities, networking topologies, power architectures, and liquid cooling solutions further compound heterogeneity, increasing the complexity of designing, deploying, and maintaining systems and data centers. This also includes designing for a spectrum of data center facilities, from hyperscale to colocation providers, across multiple geographical regions.

    The Solution: Agility and Fungibility

    To address these challenges, Google proposes designing data centers with fungibility and agility as primary considerations. Architectures need to be modular, allowing components to be designed and deployed independently and be interoperable across different vendors or generations. They should support the ability to late-bind the facility and systems to handle dynamically changing requirements. Data centers should be built on agreed-upon standard interfaces, so investments can be reused across multiple customer segments. These principles need to be applied holistically across all components of the data center, including power delivery, cooling, server hall design, compute, storage, and networking.

    Power Management

    To achieve agility and fungibility in power, Google emphasizes standardizing power delivery and management to build a resilient end-to-end power ecosystem, including common interfaces at the rack power level. Collaborating with the Open Compute Project (OCP), Google introduced new technologies around +/-400Vdc designs and an approach for transitioning from monolithic to disaggregated solutions using side-car power (Mt. Diablo). Promising technologies like low-voltage DC power combined with solid state transformers will enable these systems to transition to future fully integrated data center solutions.

    Google is also evaluating solutions for data centers to become suppliers to the grid, not just consumers, with corresponding standardization around battery-operated storage and microgrids. These solutions are already used to manage the “spikiness” of AI training workloads and for additional savings around power efficiency and grid power usage.

    Data Center Cooling

    Data center cooling is also being reimagined for the AI era. Google announced Project Deschutes, a state-of-the-art liquid cooling solution contributed to the Open Compute community. Liquid cooling suppliers like Boyd, CoolerMaster, Delta, Envicool, Nidec, nVent, and Vertiv are showcasing demos at major events. Further collaboration is needed on industry-standard cooling interfaces, new components like rear-door-heat exchangers, and reliability. Standardizing layouts and fit-out scopes across colocation facilities and third-party data centers is particularly important to enable more fungibility.

    Server Hall Design

    Bringing together compute, networking, and storage in the server hall requires standardization of physical attributes such as rack height, width, depth, weight, aisle widths, layouts, rack and network interfaces, and standards for telemetry and mechatronics. Google and its OCP partners are standardizing telemetry integration for third-party data centers, including establishing best practices, developing common naming and implementations, and creating standard security protocols.

    Open Standards for Scalable and Secure Systems

    Beyond physical infrastructure, Google is collaborating with partners to deliver open standards for more scalable and secure systems. Key highlights include:

    • Resilience: Expanding efforts on manageability, reliability, and serviceability from GPUs to include CPU firmware updates and debuggability.
    • Security: Caliptra 2.0, the open-source hardware root of trust, now defends against future threats with post-quantum cryptography, while OCP S.A.F.E. makes security audits routine and cost-effective.
    • Storage: OCP L.O.C.K. builds on Caliptra’s foundation to provide a robust, open-source key management solution for any storage device.
    • Networking: Congestion Signaling (CSIG) has been standardized and is delivering measured improvements in load balancing. Alongside continued advancements in SONiC, a new effort is underway to standardize Optical Circuit Switching.

    Sustainability

    Sustainability is embedded in Google’s work. They developed a new methodology for measuring the energy, emissions, and water impact of emerging AI workloads. This data-driven approach is applied to other collaborations across the OCP community, focusing on an embodied carbon disclosure specification, green concrete, clean backup power, and reduced manufacturing emissions.

    AI-for-AI

    Looking ahead, Google plans to leverage AI advances in its own work to amplify productivity and innovation. Deepmind AlphaChip, which uses AI to accelerate and optimize chip design, is an early example. Google sees more promising uses of AI for systems across hardware, firmware, software, and testing; for performance, agility, reliability, and sustainability; and across design, deployment, maintenance, and security. These AI-enhanced optimizations and workflows will bring the next order-of-magnitude improvements to the data center.

    Conclusion

    Google’s vision for agile and fungible data centers is crucial for meeting the dynamic demands of AI. By focusing on modular architectures, standardized interfaces, power management, liquid cooling, and open compute standards, Google aims to create data centers that can adapt to rapid changes and support the next wave of AI innovation. Collaboration within the OCP community is essential to driving these advancements forward.

    Source: Cloud Blog