Machine LearningRead the latest publications from our world-class team of researchers from our Machine Learning team who have been at the forefront of machine learning developments, including deep learning, support vector machines, and semantic analysis, for over a decade. We develop innovative technologies integrated into NEC’s products and services. Machine learning is the critical technology for data analytics and artificial intelligence. Recent progress in this field opens opportunities for various new applications.

Posts

Interpretability and Implicit Model Semantics in Biomedicine and Deep Learning

We introduce a framework to analyse interpretability in deep learning, by drawing on a formal notion of model semantics from the philosophy of science. We argue that interpretability is only one aspect of a model’s semantics and illustrate our framework with examples from biomedicine.

Offline to Online Streaming Distillation of Action Detection Models

Vision Transformers (ViTs) have achieved state-of-the-art performance in offline video action detection, but their reliance on processing fixed-size clips with full spatio-temporal attention makes them computationally expensive and ill-suited for real-time streaming applications due to massive computational redundancy. This paper introduces a novel framework to adapt these powerful offline models into efficient, online student models through knowledge distillation. We propose two causal, streaming-friendly attention architectures that replace the full self-attention mechanism: (1) a lightweight Temporal Shift Attention that integrates past context with minimal overhead, and (2) a Decomposed Spatial-Temporal Attention that combines intra-frame spatial attention with an LSTM for temporal modeling. Both architectures utilize caching to eliminate redundant operations on a frame-by-frame basis. To maximize knowledge transfer, we introduce an uncertainty-guided distillation process, which formulates the training as a multi-task learning problem. Our resulting models demonstrate significant efficiency gains, achieving up to a4x improvement in latency and throughput compared to the original offline teacher while ensuring state-of-the-art performance for online methods. Our work provides a practical and effective methodology for deploying high-accuracy transformer models in latency-sensitive, real-world video analysis systems.

Logical Guidance for the Exact Composition of Diffusion Models

We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact score-based guidance for complex logical formulas can be obtained from guidance signals associated with atomic properties. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm.Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifier guidance and classifier-free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks.

Object-Aware 4D Human Motion Generation

Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of5.9% on unseen datasets.

To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

Quantitative Bounds for Length Generalization in Transformers

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be “simulated” by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

DiscussLLM: Teaching Large Language Models When to Speak

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an “awareness gap,” limiting their potential as truly collaborative partners in dynamic human discussions. We introduce , a framework designed to bridge this gap by training models to proactively decide not just to say, but critically, to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.

Identifying Combinatorial Regulatory Genes for Cell Fate Decision via Reparameterizable Subset Explanations

Cell fate decisions are highly coordinated processes governed bycomplex interactions among numerous regulatory genes, whiledisruptions in these mechanisms can lead to developmental abnormalitiesand disease. Traditional methods often fail to capture suchcombinatorial interactions, limiting their ability to fully model cellfate dynamics. Here, we introduce MetaVelo, a global feature explanationframework for identifying key regulatory gene sets influencingcell fate transitions. MetaVelo models these transitions as ablack-box function and employs a differentiable neural ordinary differentialequation (ODE) surrogate to enable efficient optimization.By reparameterizing the problem as a controllable data generationprocess, MetaVelo overcomes the challenges posed by the nondifferentiablenature of cell fate dynamics. Benchmarking acrossdiverse stand-alone and longitudinal single-cell RNA-seq datasetsand three black-box cell fate models demonstrates its superiorityover 12 baseline methods in predicting developmental trajectoriesand identifying combinatorial regulatory gene sets. MetaVelo furtherdistinguishes independent from synergistic regulatory genes,offering novel insights into the gene interactions governing cellfate. With the growing availability of high-resolution single-celldata, MetaVelo provides a scalable and effective framework fo

On Synthesizing Data for Context Attribution in Question Answering

Question Answering (QA) accounts for a significantportion of LLM usage “in the wild”.However, LLMs sometimes produce false ormisleading responses, also known as hallucinations.Therefore, grounding the generatedanswers in contextually provided information—i.e., providing evidence for the generated text—is paramount for LLMs’ trustworthiness. Providingthis information is the task of context attribution.In this paper, we systematically studyLLM-based approaches for this task, namelywe investigate (i) zero-shot inference, (ii) LLMensembling, and (iii) fine-tuning of small LMson synthetic data generated by larger LLMs.Our key contribution is SYNQA: a novel generativestrategy for synthesizing context attributiondata. Given selected context sentences, anLLM generates QA pairs that are supported bythese sentences. This leverages LLMs’ naturalstrengths in text generation while ensuring clearattribution paths in the synthetic training data.We show that the attribution data synthesizedvia SYNQA is highly effective for fine-tuningsmall LMs for context attribution in differentQA tasks and domains. Finally, with a userstudy, we validate the usefulness of small, efficientLMs (fine-tuned on synthetic data fromSYNQA) in context attribution for QA.