Open-Source Benchmark for LLM OCR and Data Extraction
Omni, an open-source benchmarking tool, compares the OCR and data extraction capabilities of various large multimodal models like gpt-4o, evaluating both text and JSON extraction accuracy. This benchmark provides a comprehensive evaluation of OCR accuracy across traditional OCR providers and LLMs. The dataset and methodologies are open-source, encouraging contributions and expansion. The benchmark focuses on JSON extraction, measuring the accuracy of the entire pipeline: Document ⇒ OCR ⇒ Extraction. It uses a modified json-diff for JSON accuracy and Levenshtein distance for text similarity. The tool supports various models, including OpenAI, Google Gemini, and Anthropic, with a simple command-line interface and JSON output.