haifeng chen Archives | Page 3 of 15

Haifeng Chen is the Department Head of the Data Science and System Security Department at NEC Laboratories America. He received his PhD in Computer Engineering from Rutgers University. His research focuses on data mining, system security, and industrial AI. He leads NEC’s work on secure systems, anomaly detection, and AI-driven automation solutions. Based in Princeton, Dr. Chen brings deep expertise in machine learning, anomaly detection, and system health monitoring, with a particular focus on building trustworthy and scalable AI-driven platforms. He has spearheaded numerous high-impact projects, including AI for spacecraft systems, root-cause analysis in cloud environments, and dynamic graph analysis for network security.

His leadership has helped shape the department’s role as a key contributor to NEC’s innovations in fields such as enterprise systems, national defense, and space technology. Dr. Chen holds more than 80 patents and has published over 100 peer-reviewed papers in top-tier venues, earning multiple best paper awards. His contributions extend beyond technical leadership; he serves on program committees for major AI and data science conferences such as SIGKDD and AAAI and has been a panelist for NSF grant reviews. Recognized with NEC’s highest corporate honor, the Contributor of the Year award, Haifeng Chen continues to drive the lab’s efforts in developing real-world, high-impact solutions that merge cutting-edge research with scalable applications across industries.

Posts

SFS: Smarter Code Space Search improves LLM Inference Scaling

April 28, 2025/in Publications/by NEC Labs America

We frame code generation as a black-box optimization problem within the code space and demonstrate how optimization-inspired techniques can enhance inference scaling. Based on this perspective, we propose SCATTERED FOREST SEARCH (SFS), a novel approach that improves solution diversity and better exploits feedback during evolutionary search. Our theoretical analysis illustrates how these methods help avoid local optima during optimization, leading to more efficient exploration. Extensive experiments on HumanEval, MBPP, APPS, CodeContests, and Leetcode reveal significant performance gains. For instance, our method achieves a pass@1 rate of 67.1% on HumanEval+ and 87.2% on HumanEval with GPT-3.5, marking improvements of 8.6% and 4.3% over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our approach scales more efficiently than existing search techniques, including tree search, line search, and repeated sampling.

Chain-of-region: Visual Language Models Need Details for Diagram Analysis

April 25, 2025/in Publications/by NEC Labs America

Visual Language Models (VLMs) like GPT-4V have broadened the scope of LLM applications, yet they face significant challenges in accurately processing visual details, particularly in scientific diagrams. This paper explores the necessity of meticulous visual detail collection and region decomposition for enhancing the performance of VLMs in scientific diagram analysis. We propose a novel approach that combines traditional computer vision techniques with VLMs to systematically decompose diagrams into discernible visual elements and aggregate essential metadata. Our method employs techniques in OpenCV library to identify and label regions, followed by a refinement process using shape detection and region merging algorithms, which are particularly suited to the structured nature of scientific diagrams. This strategy not only improves the granularity and accuracy of visual information processing but also extends the capabilities of VLMs beyond their current limitations. We validate our approach through a series of experiments that demonstrate enhanced performance in diagram analysis tasks, setting a new standard for integrating visual and language processing in a multimodal context.

TSLA: Unified Time Series and Language Model

April 10, 2025/in Publications/by NEC Labs America

Real-world time series data often require analysis or interpretation from domain experts. Some tasks, like time series question answering, involve both time series and natural language questions, posing challenges for single-modality language models to understand their interaction. To this end, we present TSLA (Time Series Language Model), a framework designed to enhance the language model with the understanding of time series data for multi-modality tasks. TSLA comprises three key components. (1) Time Series Tokenizer learns how to represent time series data into discrete tokens, making it more manageable for language models. (2) Joint (Pre-)Training on task-agnostic time series and text data integrates time series tokens and text tokens to model the interplay between time series and language concepts. (3) Multi-task Instruction Tuning fine-tunes the pretrained TSLA for various downstream tasks relevant to user interests. For evaluation, we applied TSLA to time series data from human motions on four tasks: time series captioning, time series question answering, text-based time series synthesis, and text-based time series continuation. The results demonstrate TSLAs effectiveness in handling multiple time series analysis tasks, pointing the way for future research endeavors.

Graph Neural Networks, Explained: Our Role in the Future of AI

April 9, 2025/in News/by NEC Labs America

NEC Laboratories America (NECLA) is advancing the frontier of Graph Neural Networks (GNNs), a transformative AI technology that processes complex, interconnected data. Through innovations like PTDNet for robust learning, novel frameworks for explainability, StrGNN for anomaly detection in dynamic graphs, and GERDQ for calibration with out-of-distribution nodes, NECLA is addressing critical challenges in GNN development. These breakthroughs have real-world implications in fields such as cybersecurity, bioinformatics, and recommendation systems, positioning NECLA as a leader in the evolution of graph-based AI.

TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

March 4, 2025/in Publications/by NEC Labs America

Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.

Incident Diagnosing and Reporting System based on Retrieval Augmented Large Language Model

March 3, 2025/in Publications/by NEC Labs America

The Internet-of-Things (IoT) is widely used in many applications such as smart city, transportation, healthcare, and environment monitoring. A key task of IoT maintenance is to analyze the abnormal sensor records and generate incident report. Traditionally, domain experts engage in such labor intensive tasks. Recent advances in Large Language Model (LLM) have sparked interests in developing AI-based systems to automate these labor intensive processes. However, two critical problems hinder the effective application of LLM in IoTs: (1) LLM lacks background knowledge of deployed IoTs; and (2) the incidents are complex = events involving many sensors and components. LLM needs to understand the sensor relationships for accurate diagnosis. In this study, we propose a Retrieval Augmented language model based Incident Diagnosing and Reporting system (RAIDR) for IoT applications. RAIDR retrieves related system documents based on the incident features and leverages LLM to analyze anomalies, identify root causes, and automatically generate incident reports. The automated incident reporting process streamlines end users decision making for system maintenance and troubleshooting.

Improving Logits-based Detector without Logits from Black-box LLMs

December 9, 2024/in Publications/by NEC Labs America

The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model s distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4, and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively. Our approach performs SOTA in black-box settings on different advanced closed-source and open-source models. The versatility of our method enriches widely adopted zero-shot detection frameworks (DetectGPT, DNA-GPT, Fast-DetectGPT) with a plug-and-play enhancement feature. Extensive experiments validate that our methodology reliably secures high detection precision for LLM-generated text and effectively detects text from diverse model origins through a singular detector. Our method is also robust under the revised text attack and non-English texts.

A Survey on Detection of LLMs-Generated Content

November 13, 2024/in Publications/by NEC Labs America

The burgeoning capabilities of advanced large language models (LLMs) such as ChatGPT have led to an increase in synthetic content generation with implications across a variety of sectors, including media, cybersecurity, public discourse, and education. As such, the ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detection strategies and benchmarks, scrutinizing their differences and identifying key challenges and prospects in the field, advocating for more adaptable and robust models to enhance detection accuracy. We also posit the necessity for a multi-faceted approach to defend against various attacks to counter the rapidly advancing capabilities of LLMs. To the best of our knowledge, this work is the first comprehensive survey on the detection in the era of LLMs. We hope it will provide a broad understanding of the current landscape of LLMs-generated content detection, and we have maintained a website to consistently update the latest research as a guiding reference for researchers and practitioners.

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration (EMNLP 2024)

November 13, 2024/in Publications/by NEC Labs America

Large Language Models (LLMs) have achieved exceptional capabilities in open generation across various domains, yet they encounter difficulties with tasks that require intensive knowledge. To address these challenges, methods for integrating knowledge have been developed, which augment LLMs with domain-specific knowledge graphs through external modules. These approaches, however, face data inefficiency issues as they necessitate the processing of both known and unknown knowledge for fine-tuning. Thus, our research focuses on a novel problem: efficiently integrating unknown knowledge into LLMs without unnecessary overlap of known knowledge. A risk of introducing new knowledge is the potential forgetting of existing knowledge. To mitigate this risk, we propose the innovative InfuserKI framework. This framework employs transformer internal states to determine when to enrich LLM outputs with additional information, effectively preventing knowledge forgetting. Performance evaluations using the UMLS-2.5k and MetaQA domain knowledge graphs reveal that InfuserKI not only successfully integrates new knowledge but also outperforms state-of-the-art baselines, reducing knowledge forgetting by 9% and 6%, respectively.

Large Language Models Can Be Contextual Privacy Protection Learners

November 13, 2024/in Publications/by NEC Labs America

The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifiable information (PII). Direct fine-tuning LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (CPPLM), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theoretical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private data while enhancing the model s knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy protection learners.

Posts

SFS: Smarter Code Space Search improves LLM Inference Scaling

Chain-of-region: Visual Language Models Need Details for Diagram Analysis

TSLA: Unified Time Series and Language Model

Graph Neural Networks, Explained: Our Role in the Future of AI

Incident Diagnosing and Reporting System based on Retrieval Augmented Large Language Model

Improving Logits-based Detector without Logits from Black-box LLMs

A Survey on Detection of LLMs-Generated Content

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration (EMNLP 2024)

Large Language Models Can Be Contextual Privacy Protection Learners

Contact Us

About Us

Our Pages

Recent Publications

Events

News