ClickHouse Lock Contention: A Year-Long Performance Bottleneck
2025-03-21
Tinybird experienced a year-long puzzle of extremely low CPU utilization in one of their ClickHouse clusters during peak loads. The root cause was identified as Context lock contention. By adding a `ContextLockWaitMicroseconds` metric to monitor lock wait times and redesigning the Context locking mechanism – replacing a single global mutex with read-write mutexes – performance significantly improved. The article details using Clang's thread safety analysis to debug and resolve concurrency issues, along with benchmark results showing a 3x increase in QPS and substantial CPU utilization gains.
Development