Cheap and Effective Language Translation Quality Benchmark
2025-05-20

A developer attempted to build a more scientifically rigorous language translation quality benchmark using pairwise evaluations and a Bradley-Terry model. Initial attempts failed due to high costs, with each experiment requiring hundreds or even thousands of dollars. A compromise system was devised, combining the old scoring system with pairwise evaluations. By iteratively processing sentences, using multiple translation evaluation systems to score, and combining statistical analysis, costs were drastically reduced, yielding reliable results with good p-values. While sacrificing some rigor in blinding, the new system significantly improved efficiency, completing a German test for ~$6.
Read more
Development