Canva Outage: A Case Study in Saturation and Resilience
![Canva Outage: A Case Study in Saturation and Resilience](https://surfingcomplexity.blog/wp-content/uploads/2024/12/competence-envelope-1.png)
Canva recently experienced a major outage stemming from system saturation. A new editor page deploy wasn't the culprit; instead, a stale Cloudflare CDN rule caused massive latency for Asian users loading Javascript files. This triggered 270,000+ concurrent requests, subsequently overwhelming the API gateway with 1.5 million requests per second – three times its typical peak. A known, unfixed performance bug in the API gateway exacerbated the issue. The Linux OOM killer terminated all API gateway tasks, resulting in complete Canva.com failure. Canva engineers resolved the issue by manually increasing task counts, temporarily blocking traffic via Cloudflare firewall rules, and gradually restoring traffic. This incident highlights the importance of system resilience and the potential downsides of automated systems under heavy load.
Read more