AI infra
overview of building an NDR InfiniBand network for a GPU cluster:
Network Architecture:
Use a two-level multi-rail networking topology
Leaf-spine architecture with ToR (Top of Rack) switches
Key Components:
GPU Servers: NVIDIA HGX H100 servers, each with 8 x H100 GPUs and 8 x 400G NICs
ToR Switches: 400G NDR InfiniBand switches
Leaf Switches: Higher port count 400G NDR InfiniBand switches
Connectivity:
Each GPU server connects to ToR switches with 8 x 400G links
ToR switches connect to Leaf switches with 400G uplinks
Non-blocking 1:1 subscription ratio between server ports and network ports
Scale:
Can scale up to 1024 servers / 8192 GPUs in a two-tier network
Organized into "server pods" of 64 servers each
Cabling:
Use MPO fiber optic cables between switches
Use OSFP or QSFP-DD 400G optical transceivers
Software:
Enable RoCE (RDMA over Converged Ethernet) for low latency
Configure PFC (Priority Flow Control) and ECN (Explicit Congestion Notification)
Use BGP for routing between switches
Management:
Implement a separate management network
Use network management software for monitoring and analytics
Testing:
Validate performance with tools like PerfTest and NCCLTest
Test congestion control and overall cluster throughput
Power and Cooling:
Plan for high power density racks
Ensure adequate cooling for GPU servers and switches
The key is to create a high-bandwidth, low-latency fabric optimized for GPU-to-GPU communication using NDR InfiniBand technology. Careful planning of the network topology, cabling, and configuration is essential for maximum performance.
Would you like me to elaborate on any specific aspect of building this network?