Multimodal Document Processing is the analysis of documents containing text, images, tables, and diagrams using AI. NEC Labs America advances this field by integrating natural language processing with computer vision and structured data models. Applications include scientific paper understanding, legal document review, and medical record analysis. Multimodal Document Processing enables comprehensive insights across diverse formats, supporting research, compliance, and decision-making.

Posts

EcoDoc: A Cost-Efficient Multimodal Document Processing System for Enterprises Using LLMs

Enterprises are increasingly adopting Generative AI applications to extract insights from large volumes of multimodal documents in domains such as finance, law, healthcare, and industry. These documents contain structured and unstructured data (images, charts, handwritten texts, etc.) requiring robust AI systems for effective retrieval and comprehension. Recent advancements in Retrieval-Augmented Generation (RAG) frameworks and Vision-Language Models (VLMs) have improved retrieval performance on multimodal documents by processing pages as images. However, large-scale deployment remains challenging due to the high cost of LLM API usage and the slower inference speed of image-based processing of pages compared to text-based processing. To address these challenges, we propose EcoDoc, a cost-effective multimodal document processing system that dynamically selects the processing modalities for each page as an image or text based on page characteristics and query intent. Our experimental evaluation on TAT-DQA and DocVQA benchmarks shows that EcoDoc reduces average query processing latency by up to 2.29× and cost by up to 10×, without compromising accuracy.