Adversarial Policies Defeat Superhuman Go AIs
2024-12-24
Researchers achieved a >97% win rate against the state-of-the-art Go AI, KataGo, by training adversarial policies. These adversaries didn't win by playing Go well, but by tricking KataGo into making critical blunders. The attack transferred zero-shot to other superhuman Go AIs and was simple enough for human experts to replicate without algorithmic assistance. The vulnerability persisted even after KataGo was adversarially trained to defend against it, highlighting surprising failure modes in even superhuman AI systems.