PostgreSQL LISTEN/NOTIFY Bottleneck: Lessons from Processing Millions of Meeting Hours

2025-07-11
PostgreSQL LISTEN/NOTIFY Bottleneck: Lessons from Processing Millions of Meeting Hours

Recall.ai processes millions of hours of meeting data each month. Their Postgres database suffered downtime due to high-concurrency writes. Investigation revealed that the LISTEN/NOTIFY feature acquires a global database lock during transaction commit, serializing all commits and creating a bottleneck. Migrating this logic to the application layer resolved the issue.

Read more
Development

A Linux Kernel Thread Lifecycle Gotcha: The Case of the Randomly Dying Chromium Process

2025-04-10
A Linux Kernel Thread Lifecycle Gotcha: The Case of the Randomly Dying Chromium Process

While optimizing Recall.ai's Output Media startup latency, an engineer encountered a perplexing bug: the Chromium process would randomly terminate after launch. The root cause was traced to Bubblewrap's `--die-with-parent` flag and the Linux kernel's handling of PR_SET_PDEATHSIG. This flag causes child processes to receive a SIGKILL signal when the parent thread, not the parent process, terminates. Tokio's thread management interacted with this behavior, leading to unexpected Chromium termination when the parent thread was reaped. Removing the flag solved the issue but revealed a little-known quirk of the Linux kernel, underscoring the need for caution when handling the interaction between thread lifecycles and process isolation.

Read more
Development

83% Latency Reduction with Obscure Linux Process Flags

2025-03-06
83% Latency Reduction with Obscure Linux Process Flags

An engineer optimizing Recall.ai's Output Media encountered a perplexing issue: random Chromium process termination within a sandboxed environment. Deep debugging revealed the root cause: Linux kernel's prctl(PR_SET_PDEATHSIG, SIGKILL), which tracks parent threads, not processes. Tokio's thread management interacted unexpectedly, causing parent thread reaping and triggering SIGKILL, terminating the child process. Removing Bubblewrap's --die-with-parent flag resolved the issue, resulting in an 83% latency reduction.

Read more