NVIDIA Ingest: Microservices for Efficiently Parsing Massive Documents

2025-01-10
NVIDIA Ingest: Microservices for Efficiently Parsing Massive Documents

NVIDIA Ingest is an early access set of microservices designed to efficiently parse hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents. It extracts metadata and text for embedding into retrieval systems. Leveraging NVIDIA NIM microservices, it supports PDFs, Word, PowerPoint, and images, extracting text, tables, charts, and images, contextualizing them, and outputting structured JSON. Embeddings can be optionally computed and stored in a Milvus vector database. A Python client and command-line interface are provided for ease of use.

Development Document Parsing