Benchmarking Code Retrieval: Challenges and Voyage AI's Approach
2025-02-03

Modern coding assistants heavily rely on code retrieval, but existing evaluation methods fall short. Voyage AI's research highlights issues with current datasets, including noisy labels, lack of deep algorithmic reasoning assessment, and data contamination, leading to unreliable model evaluations. To address this, Voyage AI proposes two methods for creating high-quality code retrieval datasets: repurposing question-answer datasets and leveraging GitHub repositories and issues/tickets. Voyage AI also built its internal benchmarking suite, encompassing multiple programming languages, various QA datasets, and domain-specific benchmarks, evaluating several code embedding models. Voyage-code-3 emerged as the top performer.