OmniParser V2: Screen Parsing Tool for Pure Vision-Based GUI Agents
2025-02-15
OmniParser is a comprehensive method for parsing UI screenshots into structured, understandable elements, significantly boosting GPT-4V's ability to generate actions accurately grounded in interface regions. The recently released OmniParser V2 achieves state-of-the-art results (39.5% on Screen Spot Pro) and introduces OmniTool, enabling control of a Windows 11 VM using your vision model of choice. Detailed installation instructions and demos are provided, with model weights available on Hugging Face.