Zhengzhang Chen NEC Labs AmericaZhengzhang Chen is a Senior Researcher in the Data Science and System Security Department at NEC Laboratories America in Princeton, NJ. He received his PhD in Computer Science from North Carolina State University.

Dr. Chen’s research focuses on machine learning for dynamic and complex systems, with expertise spanning anomaly detection, causal discovery, multimodal data analysis, and trustworthy AI. He develops algorithms that integrate time-series, log, graph, and textual data to uncover hidden dependencies, identify root causes, and detect out-of-distribution behaviors in evolving networks. His contributions address critical challenges in monitoring microservices, IoT, and enterprise IT systems, ensuring the reliable and interpretable deployment of AI in real-world settings. As an accomplished researcher, Dr. Chen has published over 80 papers in premier venues, including NeurIPS, ICML, KDD, ICLR, WWW, and AAAIand holds more than 40 patents that advance anomaly detection and causal modeling. 

His recent projects at NEC Labs focus on AI for IT operations (AIOps), robust graph learning, and safe AI design, contributing both theoretical advances and practical tools that strengthen the resilience and trustworthiness of modern digital infrastructure.

Posts

Online Multi-modal Root Cause Identification in Microservice Systems

Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN introduces a long-term temporal causal learning module with two encoders: one captures stable causal dependencies from historical data, while the other models short-term variations in the current batch data. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.

SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search

Large Language Models (LLMs) offer promising capabilities for tackling complex reasoning tasks, including optimization problems. However, existing methods either rely on prompt engineering, which leads to poor generalization across problem types, or require costly supervised training. We introduce SolverLLM, a training-free framework that leverages test-time scaling to solve diverse optimization problems. Rather than solving directly, SolverLLM generates mathematical formulations and translates them into solver-ready code, guided by a novel Monte Carlo Tree Search (MCTS) strategy. To enhance the search process, we modify classical MCTS with (1) dynamic expansion for adaptive formulation generation, (2) prompt backpropagation to guide exploration via outcome-driven feedback, and (3) uncertainty backpropagation to incorporate reward reliability into decision-making. Experiments on six standard benchmark datasets demonstrate that SolverLLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.

Correlation-aware Online Change Point Detection

Change point detection aims to identify abrupt shifts occurring at multiple points within a data sequence. This task becomes particularly challenging in the online setting, where different types of change can occur, including shifts in both the marginal and joint distributions of the data. In this paper, we address these challenges by tracking the Riemannian geometry of correlation matrices, allowing Riemannian metrics to compute the geodesic distance as an accurate measure of correlation dynamics.We introduce Rio-CPD, a correlation-aware online change point detection framework that integrates the Riemannian geometry of the manifold of symmetric positive definite matrices with the cumulative sum (CUSUM) statistic for detecting change points. Rio-CPD employs a novel CUSUM design by computing the geodesic distance between current observations and the Fréchet mean of prior observations. With appropriate choices of Riemannian metrics, Rio-CPD offers a simple yet effective and computationally efficient algorithm. We also provide a theoretical analysis on standard metrics for change point detection within Rio-CPD. Experimental results on both synthetic and real-world datasets demonstrate that Rio-CPD outperforms existing methods on detection accuracy, average detection delay, and efficiency.

Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of domain objectives, and the diversity of the constraints (e.g., various social norms, cultural conformity, religious beliefs, and ethical standards in the domain applications). Domain specification techniques are key to making large language models disruptive in many applications. Specifically, to solve these hurdles, there has been a notable increase in research and practices conducted in recent years on the domain specialization of LLMs. This emerging field of study, with its substantial potential for impact, necessitates a comprehensive and systematic review to summarize better and guide ongoing work in this area. In this article, we present a comprehensive survey on domain specification techniques for large language models, an emerging direction critical for large language model applications. First, we propose a systematic taxonomy that categorizes the LLM domain-specialization techniques based on the accessibility to LLMs and summarizes the framework for all the subcategories as well as their relations and differences to each other. Second, we present an extensive taxonomy of critical application domains that can benefit dramatically from specialized LLMs, discussing their practical significance and open challenges. Last, we offer our insights into the current research status and future trends in this area.

ICeTEA: Mixture of Detectors for Metric-Log Anomaly Detection

Anomaly detection is essential for identifying unusual system behaviors and has wide-ranging applications, from fraud detection to system monitoring. In web servers, anomalies are typically detected using two types of data: metrics (numerical indicators of performance) and logs (records of system events). While correlations between metrics and logs in real-world scenarios highlight the need for joint analysis, which is termed the “metric-log anomaly detection” problem, it has not been fully explored yet due to inherent differences between metrics and logs. In this paper, we propose ICeTEA, a novel system for metric-log anomaly detection that integrates three detectors: a metric-log detector based on a multimodal Variational Autoencoder (VAE), and two individual metric and log detectors. By leveraging the ensemble technique to combine outputs of these detectors, ICeTEA enhances the effectiveness and robustness of metric-log anomaly detection. Case studies demonstrate two key functionalities of ICeTEA: data visualization and rankings of contributions to anomaly scores. Experiments demonstrate that our proposed ICeTEA accurately detects true anomalies while significantly reducing false positives.

Exploring Multi-Modal Data with Tool-Augmented LLM Agents for Precise Causal Discovery

Causal discovery is an imperative foundation for decision-making across domains, such as smart health, AI for drug discovery and AIOps. Traditional statistical causal discovery methods, while well-established, predominantly rely on observational data and often overlook the semantic cues inherent in cause-and-effect relationships. The advent of Large Language Models (LLMs) has ushered in an affordable way of leveraging the semantic cues for knowledge-driven causal discovery, but the development of LLMs for causal discovery lags behind other areas, particularly in the exploration of multimodal data. To bridge the gap, we introduce MATMCD, a multi-agent system powered by tool-augmented LLMs. MATMCD has two key agents: a Data Augmentation agent that retrieves and processes modality-augmented data, and a Causal Constraint agent that integrates multi-modal data for knowledge-driven reasoning. The proposed design of the inner-workings ensures successful cooperation of the agents. Our empirical study across seven datasets suggests the significant potential of multi-modality enhanced causal discovery

Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.

Evidence-Based Out-of-Distribution Detection on Multi-Label Graphs

The Out-of-Distribution (OOD) problem in graph-structured data is becoming increasingly important in various areas of research and applications, including social network recommendation [36], protein function detection [9, 21], etc. Furthermore, owing to the inherent multi-label properties of nodes, multi-label OOD detection remains more challenging than in multi-class scenarios. A lack of uncertainty modeling in multi-label classification methods prevents the separation of OOD nodes from in-distribution (ID) nodes. Existing uncertainty-based OOD detection methods on graphs are not applicable for multi-label scenarios because they are designed for multi-class settings. Therefore, node-level OOD detection on multi-label graphs becomes desirable but rarely touched. In this paper, we pro-pose a novel Evidence-Based Out-of-Distribution Detection method on multi-label graphs. The evidence for multiple labels, which indicates the amount of support to suggest that a sample should be classified into a specific class, is predicted by Multi-Label Evidential Graph Neural Networks (ML-EGNNs). The joint belief is designed for multi-label opinions fusion by a comultiplication operator. Additionally, we intro-duce a Kernel-based Node Positive Evidence Estimation (KNPE) method to reduce errors in quantifying positive evidence. Experimental results prove both the effectiveness and efficiency of our model for multi-label OOD detection on 7 multi-label benchmarks.

MixLLM: Dynamic Routing in Mixed Large Language Models

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-banditbased routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4’s quality at 24.18% of the cost under the time constraint). 

Graph Neural Networks, Explained: Our Role in the Future of AI

NEC Laboratories America (NECLA) is advancing the frontier of Graph Neural Networks (GNNs), a transformative AI technology that processes complex, interconnected data. Through innovations like PTDNet for robust learning, novel frameworks for explainability, StrGNN for anomaly detection in dynamic graphs, and GERDQ for calibration with out-of-distribution nodes, NECLA is addressing critical challenges in GNN development. These breakthroughs have real-world implications in fields such as cybersecurity, bioinformatics, and recommendation systems, positioning NECLA as a leader in the evolution of graph-based AI.