ClickHouse Lock Contention: A Year-Long Performance Bottleneck

2025-03-21

Tinybird experienced a year-long puzzle of extremely low CPU utilization in one of their ClickHouse clusters during peak loads. The root cause was identified as Context lock contention. By adding a `ContextLockWaitMicroseconds` metric to monitor lock wait times and redesigning the Context locking mechanism – replacing a single global mutex with read-write mutexes – performance significantly improved. The article details using Clang's thread safety analysis to debug and resolve concurrency issues, along with benchmark results showing a 3x increase in QPS and substantial CPU utilization gains.

Development