Martin Min NEC Labs AmericaMartin Renqiang Min is the Department Head of the Machine Learning Department at NEC Laboratories America. He holds a Ph.D. in Computer Science from the University of Toronto and completed postdoctoral research at Yale University, where he also taught courses on deep learning. His research has been published in top venues, including Nature, NeurIPS, ICML, ICLR, CVPR, and ACL, and his innovations have been recognized  internationally, with features in Science and MIT Technology Review.

At NEC, Dr. Min directs a multidisciplinary research team at the forefront of foundational and applied artificial intelligence. His portfolio spans deep learning, natural language understanding, multimodal learning, visual reasoning, and the application of machine learning to biomedical and healthcare data. He has contributed to the design of scalable learning systems that power real-world applications, bridging cutting-edge theory with industry-scale deployment. He also co-chaired the NeurIPS Workshop on Machine Learning in Computational Biology, advancing the dialogue between AI and life sciences.

Under his leadership, NEC’s machine learning group drives innovation across multiple domains, including AI for precision medicine, next-generation language modeling, and interpretable multimodal systems. He is known for fostering interdisciplinary collaboration—both within NEC and with academic and industry partners—encouraging research that connects scientific breakthroughs with societal impact. His team’s contributions extend to core technologies used across telecommunications, enterprise solutions, and healthcare, positioning NEC at the leading edge of applied AI. Dr. Min is recognized for his ability to identify emerging trends in machine learning and translate them into long-term research roadmaps. His work continues to influence the global AI community, and his leadership ensures NEC remains a hub for transformative research that combines fundamental discovery with practical applications that improve people’s lives.

Posts

Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.

Discrete-Continuous Variational Optimization with Local Gradients

Variational optimization (VO) offers a general approach for handling objectives which may involve discontinuities, or whose gradients are difficult to calculate. By introducing a variational distribution over the parameter space, such objectives are smoothed, and rendered amenable to VO methods. Local gradient information, though, may be available in certain problems, which is neglected by such an approach. We therefore consider a general method for incorporating local information via an augmented VO objective function to accelerate convergence and improve accuracy. We show how our augmented objective can be viewed as an instance of multilevel optimization. Finally, we show our method can train a genetic algorithm simulator, using a recursive Wasserstein distance objective

Understanding Transcriptional Regulatory Redundancy by Learnable Global Subset Perturbations

Transcriptional regulation through cis-regulatory elements (CREs) is crucial for numerous biological functions, with its disruption potentially leading to various diseases. It is well-known that these CREs often exhibit redundancy, allowing them to compensate for each other in response to external disturbances, highlighting the need for methods to identify CRE sets that collaboratively regulate gene expression effectively. To address this, we introduce GRIDS, an in silico computational method that approaches the task as a global feature explanation challenge to dissect combinatorial CRE effects in two phases. First, GRIDS constructs a differentiable surrogate function to mirror the complex gene regulatory process, facilitating cross-translations in single-cell modalities. It then employs learnable perturbations within a state transition framework to offer global explanations, efficiently navigating the combinatorial feature landscape. Through comprehensive bench marks, GRIDS demonstrates superior explanatory capabilities compared to other leading methods. Moreover, GRIDS s global explanations reveal intricate regulatory redundancy across cell types and states, underscoring its potential to advance our understanding ofcellular regulation in biological research.

Variational methods for Learning Multilevel Genetic Algorithms using the Kantorovich Monad

Levels of selection and multilevel evolutionary processes are essential concepts in evolutionary theory, and yet there is a lack of common mathematical models for these core ideas. Here, we propose a unified mathematical framework for formulating and optimizing multilevel evolutionary processes and genetic algorithms over arbitrarily many levels based on concepts from category theory and population genetics. We formulate a multilevel version of the Wright-Fisher process using this approach, and we show that this model can be analyzed to clarify key features of multilevel selection. Particularly, we derive an extended multilevel probabilistic version of Price’s Equation via the Kantorovich Monad, and we use this to characterize regimes of parameter space within which selection acts antagonistically or cooperatively across levels. Finally, we show how our framework can provide a unified setting for learning genetic algorithms (GAs), and we show how we can use a Variational Optimization and a multi-level analogue of coalescent analysis to fit multilevel GAs to simulated data.

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centered around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with in-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9%, 3.1%, and 2.8%.

Predicting Spatially Resolved Gene Expression via Tissue Morphology using Adaptive Spatial GNNs (ECCB)

Spatial transcriptomics technologies, which generate a spatial map of gene activity, can deepen the understanding of tissue architecture and its molecular underpinnings in health and disease. However, the high cost makes these technologies difficult to use in practice. Histological images co-registered with targeted tissues are more affordable and routinely generated in many research and clinical studies. Hence, predicting spatial gene expression from the morphological clues embedded in tissue histological images provides a scalable alternative approach to decoding tissue complexity.

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

We are thrilled to announce the launch of our latest research initiative, the Trustworthy Generative AI Project. This ambitious project is set to revolutionize how we interact with multimodal content by developing cutting-edge generative models capable of compositional generation and reasoning across text, images, reports, and even 3D videos.

LLMs and MI Bring Innovation to Material Development Platforms

In this paper, we introduce efforts to apply large language models (LLMs) to the field of material development. NEC is advancing the development of a material development platform. By applying core technologies corresponding to two material development steps, namely investigation activities (Read paper/patent) and experimental planning (Design Experiment Plan), the platform organizes documents such as papers and reports as well as data such as experimental results and then presents in an interactive way to users. In addition, with techniquesthat reflect physical and chemical principles into machine learning models, AI can learn even with limited data and accurately predict material properties. Through this platform, we aim to achieve the seamless integration of materials informatics (MI) with a vast body of industry literature and knowledge, thereby bringing innovation to the material development process.

Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Weakly Supervised Temporal Action Localization (WSTAL) aims to jointly localize and classify action segments in untrimmed videos with only video level annotations. To leverage video level annotations most existing methods adopt the multiple instance learning paradigm where frame/snippet level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships we argue that those implementations have not sufficiently exploited such information. In this paper we propose Multi Modal Plateau Transformers (M2PT) for WSTAL by simultaneously exploiting temporal relationships among snippets complementary information across data modalities and temporal coherence among consecutive snippets. Specifically M2PT explores a dual Transformer architecture for RGB and optical flow modalities which models intra modality temporal relationship with a self attention mechanism and inter modality temporal relationship with a cross attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state of the art performance.