FizzBee: Modeling Mutual Exclusion and the Pitfalls of Redlock

2025-03-22
FizzBee: Modeling Mutual Exclusion and the Pitfalls of Redlock

This article details the author's experience using FizzBee, a new formal specification language built on Starlark, to model mutual exclusion algorithms and investigate issues with the Redlock algorithm. By modeling critical sections, locks, leases, and fencing tokens, the author reveals limitations in Redlock's fault tolerance, ultimately showing that fencing tokens don't completely solve mutual exclusion problems. The author concludes by discussing FizzBee's ease of use and shortcomings while highlighting the importance of formal specification in algorithm design. The practical exercise unexpectedly revealed subtle flaws in the author's understanding of fencing tokens, underscoring the value of formal methods.

Read more
Development mutual exclusion

Ignoring Near Misses: A Hidden Risk for Tech Companies

2025-02-08
Ignoring Near Misses: A Hidden Risk for Tech Companies

FAA data reveals 30 near-misses at Reagan Airport. This article argues that tech companies often prioritize preventing major incidents, overlooking the numerous near-misses that could escalate. Near misses, precursors to significant incidents, are frequently ignored due to their zero impact. The author advocates treating near misses as seriously as major incidents, creating mechanisms to identify and analyze them proactively. This requires a cultural shift, encouraging reporting and analysis to improve reliability.

Read more

Canva Outage: A Case Study in Saturation and Resilience

2025-01-12
Canva Outage: A Case Study in Saturation and Resilience

Canva recently experienced a major outage stemming from system saturation. A new editor page deploy wasn't the culprit; instead, a stale Cloudflare CDN rule caused massive latency for Asian users loading Javascript files. This triggered 270,000+ concurrent requests, subsequently overwhelming the API gateway with 1.5 million requests per second – three times its typical peak. A known, unfixed performance bug in the API gateway exacerbated the issue. The Linux OOM killer terminated all API gateway tasks, resulting in complete Canva.com failure. Canva engineers resolved the issue by manually increasing task counts, temporarily blocking traffic via Cloudflare firewall rules, and gradually restoring traffic. This incident highlights the importance of system resilience and the potential downsides of automated systems under heavy load.

Read more

The Future of Dashboard Design?

2024-12-23
The Future of Dashboard Design?

This article explores the shortcomings of current dashboard design. The author points out that existing dashboards are often poorly designed and fail to effectively utilize the human visual system to process large amounts of information. The article reviews cognitive systems engineering research from the 80s and 90s on dashboard design, such as ecological interface design and visual momentum, and notes the current industry's lack of focus on improving dashboard design. The author calls for greater attention to dashboard design, to better integrate query functions and improve information processing efficiency.

Read more

OpenAI Outage: Unexpected Load from New Telemetry Service Causes Major Disruption

2024-12-16
OpenAI Outage: Unexpected Load from New Telemetry Service Causes Major Disruption

OpenAI experienced a major service disruption on December 11th, stemming from a newly deployed telemetry service. Intended to improve reliability, this service unexpectedly generated massive Kubernetes API server load, saturating the servers and causing the Kubernetes control plane to fail in most large clusters. This led to the breakdown of DNS-based service discovery. The incident highlights the unpredictable interactions within complex systems and the challenges of testing for failure modes that only appear under full load. OpenAI restored service by scaling down clusters, blocking network access to Kubernetes admin APIs, and scaling up API servers.

Read more