OceanGate Disaster: When Accountability Fails

2025-08-24
OceanGate Disaster: When Accountability Fails

The OceanGate submersible implosion investigation report repeatedly mentions 'accountability,' but this article argues it's not a panacea. It categorizes problems into two types: coordination challenges and miscalibrated risk models. In coordination challenges, accountability can lead to blaming individuals while ignoring systemic issues. With miscalibrated risk models, even with the CEO piloting the submersible and having 'skin in the game,' incorrect risk assessment led to disaster. The article argues that solutions require cross-team collaboration and independent safety oversight, not just accountability. Accountability can exacerbate 'double binds,' where individuals face conflicting pressures, leading to safety risks being overlooked.

Read more

Formal Specifications: Beyond Instructions, Defining Software Behaviors

2025-07-28
Formal Specifications: Beyond Instructions, Defining Software Behaviors

This post delves into the distinction between formal specifications and traditional programs. While programs are lists of instructions, formal specifications are sets of behaviors. Using a counter example, the author illustrates how specifications define all correct behaviors and leverage set theory, employing generators (Init and Next) to describe infinite sets of behaviors. This contrasts with the concept of nondeterminism in programming; in formal specifications, nondeterminism refers to multiple ways a behavior can be extended, while in programs, it refers to uncertain code paths. The article emphasizes understanding formal specifications as sets of behaviors, crucial for debugging and interpreting model checker errors.

Read more

Amazon Alexa's AI Failure: A Case Study in Brittleness

2025-06-11
Amazon Alexa's AI Failure: A Case Study in Brittleness

This article analyzes why Amazon's Alexa lagged behind competitors in the large language model space, framing it as a 'brittleness' failure within resilience engineering. The author highlights three key contributing factors: inefficient resource allocation hindering timely access to crucial compute resources; a highly decentralized organizational structure fostering misaligned team goals and internal conflict; and an outdated customer-centric approach ill-suited to the experimental and long-term nature of AI research. These combined factors led to Amazon's AI setback, offering valuable lessons for organizational structure and resource management.

Read more
AI

Beyond Root Cause Analysis: Resilience Engineering for Complex System Failures

2025-05-24
Beyond Root Cause Analysis: Resilience Engineering for Complex System Failures

This article critiques the limitations of Root Cause Analysis (RCA) in analyzing complex system failures, arguing that its flawed causal chain model fails to effectively address failures caused by the interaction of multiple factors in complex systems. The author proposes Resilience Engineering (RE) as an alternative. RE focuses on interactions between system components rather than single causes. RE acknowledges that systems always contain numerous latent failures; success lies in the system's adaptive capacity and fault tolerance. By understanding how the system adapts and copes with failures, rather than simply eliminating root causes, continuous improvement and increased system resilience are achieved.

Read more

FizzBee: Modeling Mutual Exclusion and the Pitfalls of Redlock

2025-03-22
FizzBee: Modeling Mutual Exclusion and the Pitfalls of Redlock

This article details the author's experience using FizzBee, a new formal specification language built on Starlark, to model mutual exclusion algorithms and investigate issues with the Redlock algorithm. By modeling critical sections, locks, leases, and fencing tokens, the author reveals limitations in Redlock's fault tolerance, ultimately showing that fencing tokens don't completely solve mutual exclusion problems. The author concludes by discussing FizzBee's ease of use and shortcomings while highlighting the importance of formal specification in algorithm design. The practical exercise unexpectedly revealed subtle flaws in the author's understanding of fencing tokens, underscoring the value of formal methods.

Read more
Development mutual exclusion

Ignoring Near Misses: A Hidden Risk for Tech Companies

2025-02-08
Ignoring Near Misses: A Hidden Risk for Tech Companies

FAA data reveals 30 near-misses at Reagan Airport. This article argues that tech companies often prioritize preventing major incidents, overlooking the numerous near-misses that could escalate. Near misses, precursors to significant incidents, are frequently ignored due to their zero impact. The author advocates treating near misses as seriously as major incidents, creating mechanisms to identify and analyze them proactively. This requires a cultural shift, encouraging reporting and analysis to improve reliability.

Read more

Canva Outage: A Case Study in Saturation and Resilience

2025-01-12
Canva Outage: A Case Study in Saturation and Resilience

Canva recently experienced a major outage stemming from system saturation. A new editor page deploy wasn't the culprit; instead, a stale Cloudflare CDN rule caused massive latency for Asian users loading Javascript files. This triggered 270,000+ concurrent requests, subsequently overwhelming the API gateway with 1.5 million requests per second – three times its typical peak. A known, unfixed performance bug in the API gateway exacerbated the issue. The Linux OOM killer terminated all API gateway tasks, resulting in complete Canva.com failure. Canva engineers resolved the issue by manually increasing task counts, temporarily blocking traffic via Cloudflare firewall rules, and gradually restoring traffic. This incident highlights the importance of system resilience and the potential downsides of automated systems under heavy load.

Read more

The Future of Dashboard Design?

2024-12-23
The Future of Dashboard Design?

This article explores the shortcomings of current dashboard design. The author points out that existing dashboards are often poorly designed and fail to effectively utilize the human visual system to process large amounts of information. The article reviews cognitive systems engineering research from the 80s and 90s on dashboard design, such as ecological interface design and visual momentum, and notes the current industry's lack of focus on improving dashboard design. The author calls for greater attention to dashboard design, to better integrate query functions and improve information processing efficiency.

Read more

OpenAI Outage: Unexpected Load from New Telemetry Service Causes Major Disruption

2024-12-16
OpenAI Outage: Unexpected Load from New Telemetry Service Causes Major Disruption

OpenAI experienced a major service disruption on December 11th, stemming from a newly deployed telemetry service. Intended to improve reliability, this service unexpectedly generated massive Kubernetes API server load, saturating the servers and causing the Kubernetes control plane to fail in most large clusters. This led to the breakdown of DNS-based service discovery. The incident highlights the unpredictable interactions within complex systems and the challenges of testing for failure modes that only appear under full load. OpenAI restored service by scaling down clusters, blocking network access to Kubernetes admin APIs, and scaling up API servers.

Read more