Meta's Backend Aggregation (BAG) is a super spine network layer designed to interconnect thousands of GPUs across multiple data centers and regions, enabling gigawatt-scale AI clusters like Prometheus. This architecture facilitates high-capacity, resilient networking, allowing for the creation of massive, distributed compute resources for AI workloads. The design emphasizes modularity, advanced routing, and robust topologies to achieve unprecedented scale and reliability.
Read original on Meta EngineeringThe article details Backend Aggregation (BAG), a critical networking component in Meta's strategy to build and operate immense AI clusters. BAG functions as a centralized Ethernet-based super spine network layer, primarily responsible for interconnecting multiple spine layer fabrics across various data centers and regions. This design enables the pooling of thousands of GPUs into a single, logical, gigawatt-scale AI cluster, such as Prometheus.
BAG acts as the aggregation point between regional networks and Meta’s backbone, essential for creating 'mega' AI clusters. It's engineered to support immense bandwidth, with inter-BAG capacities reaching the petabit range (e.g., 16-48 Pbps per region pair). The distributed nature of BAG layers regionally allows for the interconnection of tens of thousands of GPUs, addressing the critical challenge of scaling compute resources across geographical boundaries.
BAG layers are distributed strategically across regions, connecting to different L2 fabrics like Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF). Inter-BAG connectivity uses either planar or spread connection topologies. Planar offers simplified management but concentrates failure domains, while spread enhances path diversity and resilience by distributing links across multiple BAG switches/planes. Careful oversubscription management (e.g., 4.5:1 from L2 to BAG) balances scale and performance.
Design Consideration: Topology Choice
The choice between planar and spread topologies highlights a common system design trade-off: simplicity vs. resilience. Planar is easier to manage but less fault-tolerant, whereas spread offers greater resilience at the cost of increased complexity in setup and management. Architects must weigh these factors based on criticality and operational capabilities.
The network design for BAG meticulously addresses resilience through port striping, IP addressing schemes, and comprehensive failure domain analysis at various levels (BAG, data hall, power distribution). Strategies like draining affected BAG planes and conditional route aggregation are employed to mitigate risks such as blackholing, ensuring high availability even at extreme scales.