Machine LearningOur Machine Learning team has been at the forefront of machine learning developments, including deep learning, support vector machines, and semantic analysis, for over a decade. We develop innovative technologies integrated into NEC’s products and services. Machine learning is the critical technology for data analytics and artificial intelligence. Recent progress in this field opens opportunities for various new applications.

Deep learning will maintain prominence with more robust model architectures, training methods, and optimization techniques. Enhanced interpretability and explainability will be imperative, especially for AI systems in critical domains like healthcare and finance. Addressing bias and ensuring fairness in AI algorithms will be a top priority, leading to the development of tools and guidelines for ethical AI. Federated learning, quantum computing’s potential impact, and the growth of edge computing will diversify ML applications.

Natural language processing will continue to advance, driving progress in conversational AI, while healthcare, finance, education, and creative industries will witness profound AI integration. As quantum computing matures, it could revolutionize machine learning, while edge computing and federated learning will expand AI’s reach across various domains. Our machine learning research will produce innovation across industries, including more accurate medical diagnoses, safer autonomous systems, and efficient energy use while enabling personalized education and AI-generated creativity.

Read our news and publications from our world-class team of researchers from our Machine Learning department.

Posts

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions.

Matching Confidences and Softened Target Occurrences for Calibration

The problem of calibrating deep neural networks (DNNs) is gaining attention, as these networks are becoming central to many real-world applications. Different attempts have been made to counter the poor calibration of DNNs. Amongst others, train-time calibration methods have unfolded as an effective class for improving model calibration. Motivated by this, we propose a novel train-time calibration method that is built on a new auxiliary loss formulation, namely multiclass alignment of confidences with the gradually softened ground truth occurrences (MACSO). It is developed on the intuition that, for a class, the gradually softened ground truth occurrences distribution is a suitable non-zero entropy signal whose better alignment withthe predicted confidences distribution is positively correlated with reducing the model calibration error. In our train-time approach, besides simply aligning the two distributions, e.g., via their means or KL divergence, we propose to quantify the linear correlation between the two distributions, which preserves the relations among them, thereby further improving the calibration performance. Finally, we also reveal that MACSO posses desirable theoretical properties. Extensive results on several challenging datasets, featuring in and out-of-domain scenarios, class imbalanced problem, and a medical image classification task, validate the efficacy of our method against state-of-the-art train-time calibration methods.

A Variational Graph Partitioning Approach to Modeling Protein Liquid-liquid Phase Separation

Graph neural networks (GNNs) have emerged as powerful tools for representation learning. Their efficacy depends on their having an optimal underlying graph. In many cases, the most relevant information comes from specific subgraphs. In this work, we introduce a GNN-based framework (graph-partitioned GNN [GP-GNN]) to partition the GNN graph to focus on the most relevant subgraphs. Our approach jointly learns task-dependent graph partitions and node representations, making it particularly effective when critical features reside within initially unidentified subgraphs. Protein liquid-liquid phase separation (LLPS) is a problem especially well-suited to GP-GNNs because intrinsically disordered regions (IDRs) are known to function as protein subdomains in it, playing a key role in the phase separation process. In this study, we demonstrate how GP-GNN accurately predicts LLPS by partitioning protein graphs into task-relevant subgraphs consistent with known IDRs. Our model achieves state-of-the-art accuracy in predicting LLPS and offers biological insights valuable for downstream investigation.

Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024

Separating disinformation from fact on the web has long challenged both the search and the reasoning powers of humans. We show that the reasoning power of large language models (LLMs) and the retrieval power of modern search engines can be combined to automate this process and explainably verify claims. We integrate LLMs and search under a multi-hop evidence pursuit strategy. This strategy generates an initial question based on an input claim using a sequence to sequence model, searches and formulates an answer to the question, and iteratively generates follow-up questions to pursue the evidence that is missing using an LLM. We demonstrate our system on the FEVER 2024 (AVeriTeC) shared task. Compared to a strategy of generating all the questions at once, our method obtains .045 higher label accuracy and .155 higher AVeriTeC score (evaluating the adequacy of the evidence). Through ablations, we show the importance of various design choices, such as the question generation method, medium-sized context, reasoning with one document at a time, adding metadata, paraphrasing, reducing the problem to two classes, and reconsidering the final verdict. Our submitted system achieves .510 AVeriTeC score on the dev set and .477 AVeriTec score on the test set.

Variational methods for Learning Multilevel Genetic Algorithms using the Kantorovich Monad

Levels of selection and multilevel evolutionary processes are essential concepts in evolutionary theory, and yet there is a lack of common mathematical models for these core ideas. Here, we propose a unified mathematical framework for formulating and optimizing multilevel evolutionary processes and genetic algorithms over arbitrarily many levels based on concepts from category theory and population genetics. We formulate a multilevel version of the Wright-Fisher process using this approach, and we show that this model can be analyzed to clarify key features of multilevel selection. Particularly, we derive an extended multilevel probabilistic version of Price’s Equation via the Kantorovich Monad, and we use this to characterize regimes of parameter space within which selection acts antagonistically or cooperatively across levels. Finally, we show how our framework can provide a unified setting for learning genetic algorithms (GAs), and we show how we can use a Variational Optimization and a multi-level analogue of coalescent analysis to fit multilevel GAs to simulated data.

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

When performing complex multi-step reasoning tasks, the ability of Large Language Models (LLMs) to derive structured intermediate proof steps is important for ensuring that the models truly perform the desired reasoning and for improving models’ explainability. This paper is centered around a focused study: whether the current state-of-the-art generalist LLMs can leverage the structures in a few examples to better construct the proof structures with in-context learning. Our study specifically focuses on structure-aware demonstration and structure-aware pruning. We demonstrate that they both help improve performance. A detailed analysis is provided to help understand the results.

Enhancing Optical Multiplex Section QoT Estimation Using Scalable Gray-box DNN

In Optical Multiplex Section (OMS) control and optimization framework, end-to-end (Global) and span-by-span (Local) DNN gray-box strategies are compared in terms of scalability and accuracy of the output signal and noise power predictions. Experimental measurements are carried out in OMSs with increasing number of spans.

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9%, 3.1%, and 2.8%.

Predicting Spatially Resolved Gene Expression via Tissue Morphology using Adaptive Spatial GNNs (ECCB)

Spatial transcriptomics technologies, which generate a spatial map of gene activity, can deepen the understanding of tissue architecture and its molecular underpinnings in health and disease. However, the high cost makes these technologies difficult to use in practice. Histological images co-registered with targeted tissues are more affordable and routinely generated in many research and clinical studies. Hence, predicting spatial gene expression from the morphological clues embedded in tissue histological images provides a scalable alternative approach to decoding tissue complexity.

MCTR: Multi Camera Tracking Transformer

Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.