Conquering 3200 Gbps Network: A Journey with RDMA, EFA, and libfabric
2025-01-03
At Perplexity AI, the author leveraged RDMA, EFA, and libfabric on AWS p5 instances (with 8 NVIDIA H100 GPUs interconnected via NVSwitch) to achieve nearly perfect utilization (97%) of the 3200 Gbps network bandwidth. This article details the process, sharing valuable insights and optimization techniques (multi-threading, CPU core pinning, state sharding, etc.) for high-performance network programming. It highlights the advantages of asynchronous communication models over collective communication methods.
Development
High-Performance Networking