Projects | DDA: Deep Document Analysis

DATA SCIENCE & SYSTEM SECURITY

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

DDA: Deep Document Analysis

Unstructured data is growing at an unprecedented rate, valuable knowledge, including findings, observations, business demand, opportunities, is widely recorded as texts in documents. We are developing advanced analysis engines for mining text data in documents, aiming to discover valuable knowledge from large-scale documents and provide informed decision-making for users.

This project focuses on document knowledge discovery utilizing advanced natural language processing, deep learning, and machine learning techniques. It builds innovative analytic engines to model the large amount of document data generated from various scenarios. The engines provide interpretable knowledge with low-resource requirements in different languages and domains, further helping customers understand and optimize the decision-making process.

Deep Document Analysis and Large Language Models

In addition, this project focuses on advancing the state-of-the-art in NLP for document understanding. Toward this goal, efforts have been put into tasks like information extraction, language modeling, domain adaptation, intention detection, and contrastive augmentation. We also focus on providing solutions to different industries (e.g., financial, business, security, and systems) that can help customers with operation management and decision-making optimization. Some examples are document-based business matching, business process optimization, threat intelligence discovery, and log-based system management.

Team Members: Yanchi Liu, Xujiang Zhao, Wei Cheng, Haifeng Chen

Keyword Tags: dda, deep document, language models, large language model

Large Language Model Publications

MixLLM: Dynamic Routing in Mixed Large Language Models

April 29, 2025/2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025)

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response

TSLA: Unified Time Series and Language Model

April 10, 2025/2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

Real-world time series data often require analysis or interpretation from domain experts. Some tasks, like time series question answering, involve both time series and natural language questions, posing challenges for single-modality language models to understand their interaction. To this end, we present

G-Litter Marine Litter Dataset Augmentation with Diffusion Models and Large Language Models on GPU Acceleration

March 12, 2025/Applications, Libraries, and Tools for Computational Science and Machine Learning on Heterogeneous HPC Environments Workshop at PDP 2025

Marine litter detection is crucial for environmental monitoring, yet the imbalance in existing datasets limits model performance in identifying various types of waste accurately. This paper presents an efficient data augmentation pipeline that combines generative diffusion models (e.g., Stable Diffusion)

DiCE-M: Distributed Code Generation and Execution for Marine Applications – An Edge-Cloud Approach

December 7, 2024/International Workshop on Edge Intelligence in conjunction with ACM SEC 2024

Edge computing has emerged as a transformative technology that reduces application latency, improves cost efficiency, enhances security, and enables large-scale deployment of applications across various domains. In environmental monitoring, systems such as MegaSense[49], use low-cost sensors to gather

DiCE: Distributed Code generation and Execution

November 5, 2024/The 22nd IEEE International Conference on Pervasive Intelligence and Computing (PICom 2024)

Generative artificial intelligence (GenAI), specifically, Large Language Models (LLMs), have shown tremendous potential in automating several tasks and improving human productivity. Recent works have shown them to be quite useful in writing and summarizing text (articles, blogs, poems, stories, songs,

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

September 24, 2024/27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges

Introducing Our New Project: Time Series Language Model for Explainable AI

August 13, 2024

Our new project, Time Series Language Model for Explainable AI, represents a significant leap forward in the field of forecasting and explainable AI. By combining advanced forecasting techniques with explainable AI, we are paving the way for a future where data-driven insights are not only accurate but

Agentic LLMs for AI Orchestration Project: Revolutionizing Complex Workflows

August 8, 2024

The development of Agentic LLMs for AI Orchestration represents a significant advancement in artificial intelligence. By seamlessly integrating computer vision, logic, and compute modules, our LLM is poised to revolutionize the way complex workflows are managed and executed. Supported by robust research

DFA-RAG: Conversational Semantic Router for Large Language Model with Definite Finite Automaton

July 27, 2024/The Forty-first International Conference on Machine Learning (ICML 2024), Vienna, Austria

This paper introduces the retrieval-augmented large language model with Definite Finite Automaton (DFA-RAG), a novel framework designed to enhance the capabilities of conversational agents using large language models (LLMs). Traditional LLMs face challenges in generating regulated and compliant responses

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

June 17, 2024/CVPR2024

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of

ECO-LLM: LLM-based Edge Cloud Optimization

June 3, 2024/AI4Sys '24 at HPDC 2024

AI/ML techniques have been used to solve systems problems, but their applicability to customize solutions on-the-fly has been limited. Traditionally, any customization required manually changing the AI/ML model or modifying the code, configuration parameters, application settings, etc. This incurs too

Self-Consistent Decoding for More Factual Open Responses

February 29, 2024/https://arxiv.org

Self-consistency has emerged as a powerful method for improving the accuracy of short answers generated by large language models. As previously defined, it only concerns the accuracy of a final answer parsed from generated text. In this work, we extend the idea to open response generation, by integrating

Improving Language-Based Object Detection by Explicit Generation of Negative Examples

December 21, 2023/https://arxiv.org

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

December 8, 2023/https://arxiv.org

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On

Beyond One Model Fits All: A Survey of Domain Specialization for Large Language Models

June 9, 2023/arXiv

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task agnostic foundation for a wide range of applications. The great promise of LLMs as general task solvers motivated people to extend their functionality largely beyond

Dynamic Prompting: A Unified Framework for Prompt Tuning

March 6, 2023/arXiv

It has been demonstrated that prompt tuning is highly effective in efficiently eliciting knowledge from language models (LMs). However, the prompt tuning still lags behind fine tuning, especially when the LMs are small. P tuning v2 (Liu et al., 2021b) makes it comparable with finetuning by adding continuous