Hunting a Higgs-Bugson: Debugging a Kernel-Level NFS/Kerberos Issue

2025-07-03
Hunting a Higgs-Bugson: Debugging a Kernel-Level NFS/Kerberos Issue

Engineers encountered a difficult-to-reproduce bug causing file copy failures (-EACCES) in Gord, a critical trading data system. Disabling Kerberos resolved the issue, pointing to authentication problems. Investigation revealed the kernel obtains Kerberos credentials via the rpc_gssd daemon, but logs showed no anomalies. Extensive testing, including creating an in-memory fake filesystem and using bpftrace for kernel tracing, finally pinpointed the issue: high NFS server load caused request retransmissions. The kernel mishandled requests/responses with identical XIDs but different GSS sequence numbers, leading to checksum mismatches and errors. The engineer fixed the kernel to prevent immediate retransmission due to sequence number mismatches.

Development kernel bug