Hunting a Higgs-Bugson: Debugging a Kernel-Level NFS/Kerberos Issue
Engineers encountered a difficult-to-reproduce bug causing file copy failures (-EACCES) in Gord, a critical trading data system. Disabling Kerberos resolved the issue, pointing to authentication problems. Investigation revealed the kernel obtains Kerberos credentials via the rpc_gssd daemon, but logs showed no anomalies. Extensive testing, including creating an in-memory fake filesystem and using bpftrace for kernel tracing, finally pinpointed the issue: high NFS server load caused request retransmissions. The kernel mishandled requests/responses with identical XIDs but different GSS sequence numbers, leading to checksum mismatches and errors. The engineer fixed the kernel to prevent immediate retransmission due to sequence number mismatches.