Conquering 3200 Gbps Network: A Journey with RDMA, EFA, and libfabric

2025-01-03

At Perplexity AI, the author leveraged RDMA, EFA, and libfabric on AWS p5 instances (with 8 NVIDIA H100 GPUs interconnected via NVSwitch) to achieve nearly perfect utilization (97%) of the 3200 Gbps network bandwidth. This article details the process, sharing valuable insights and optimization techniques (multi-threading, CPU core pinning, state sharding, etc.) for high-performance network programming. It highlights the advantages of asynchronous communication models over collective communication methods.