Building a Robust Evaluation Framework for RAG Systems

2025-02-14
Building a Robust Evaluation Framework for RAG Systems

Qodo built a Retrieval Augmented Generation (RAG)-based AI coding assistant and developed a robust evaluation framework to ensure accuracy and comprehensiveness. Challenges included verifying the correctness of RAG outputs derived from large, private datasets. The framework evaluates the final retrieved documents and the final generated output, focusing on 'answer correctness' and 'retrieval accuracy'. To address the challenges of natural language outputs, they employed an 'LLM-as-judge' approach and built a ground truth dataset with real questions, answers, and context. For efficiency, they leveraged LLMs to assist in dataset construction and used LLMs and RAGAS to evaluate answer correctness. Ultimately, they built their own LLM judge and combined it with RAGAS for improved reliability, integrating it into their workflow with regression testing, dramatically reducing the effort to verify code changes' impact on quality.

Development LLM Evaluation