Martin Min NEC Labs America

Martin Renqiang Min

Department Head

Machine Learning

Posts

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9%, 3.1%, and 2.8%.

Predicting Spatially Resolved Gene Expression via Tissue Morphology using Adaptive Spatial GNNs

Spatial transcriptomics technologies, which generate a spatial map of gene activity, can deepen the understanding of tissue architecture and its molecular underpinnings in health and disease. However, the high cost makes these technologies difficult to use in practice. Histological images co-registered with targeted tissues are more affordable and routinely generated in many research and clinical studies. Hence, predicting spatial gene expression from the morphological clues embedded in tissue histological images provides a scalable alternative approach to decoding tissue complexity.

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

We are thrilled to announce the launch of our latest research initiative, the Trustworthy Generative AI Project. This ambitious project is set to revolutionize how we interact with multimodal content by developing cutting-edge generative models capable of compositional generation and reasoning across text, images, reports, and even 3D videos.

LLMs and MI Bring Innovation to Material Development Platforms

In this paper, we introduce efforts to apply large language models (LLMs) to the field of material development. NEC is advancing the development of a material development platform. By applying core technologies corresponding to two material development steps, namely investigation activities (Read paper/patent) and experimental planning (Design Experiment Plan), the platform organizes documents such as papers and reports as well as data such as experimental results and then presents in an interactive way to users. In addition, with techniquesthat reflect physical and chemical principles into machine learning models, AI can learn even with limited data and accurately predict material properties. Through this platform, we aim to achieve the seamless integration of materials informatics (MI) with a vast body of industry literature and knowledge, thereby bringing innovation to the material development process.

Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Weakly Supervised Temporal Action Localization (WSTAL) aims to jointly localize and classify action segments in untrimmed videos with only video level annotations. To leverage video level annotations most existing methods adopt the multiple instance learning paradigm where frame/snippet level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships we argue that those implementations have not sufficiently exploited such information. In this paper we propose Multi Modal Plateau Transformers (M2PT) for WSTAL by simultaneously exploiting temporal relationships among snippets complementary information across data modalities and temporal coherence among consecutive snippets. Specifically M2PT explores a dual Transformer architecture for RGB and optical flow modalities which models intra modality temporal relationship with a self attention mechanism and inter modality temporal relationship with a cross attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state of the art performance.

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

In this paper we explore the capability of an agent to construct a logical sequence of action steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets such as heavy intermediate visual observations procedural names or natural language step-by-step instructions for features or supervision signals. However the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked we propose to enhance the agent’s capabilities by infusing it with procedural knowledge. This knowledge sourced from training procedure plans and structured as a directed weighted graph equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP a novel Knowledge-Enhanced Procedure Planning system which harnesses a probabilistic procedural knowledge graph extracted from training data effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior state-of-the-art results while requiring only minimal supervision. Code and trained model are available at https://github.com/Ravindu-Yasas-Nagasinghe/KEPP

NEC Labs America Team Attending CVPR 2024 in Seattle

Our team will be attending CVPR 2024 (The IEEE /CVF Conference on Computer Vision & Pattern Recognition) from June 17-21! See you there at the NEC Labs America Booth 1716! Stay tuned for more information about our participation.

Impeller: A Path-based Heterogeneous Graph Learning Method for Spatial Transcriptomic Data Imputation

Recent advances in spatial transcriptomics allow spatially resolved gene expression measurements with cellular or even sub-cellular resolution, directly characterizing the complex spatiotemporal gene expression landscape and cell-to-cell interactions in their native microenvironments. Due to technology limitations, most spatial transcriptomic technologies still yield incomplete expression measurements with excessive missing values. Therefore, gene imputation is critical to filling in missing data, enhancing resolution, and improving overall interpretability. However, existing methods either require additional matched single-cell RNA-seq data, which is rarely available, or ignore spatial proximity or expression similarity information

Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering

In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally nontrivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actionsinvolved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alterthe function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.

Few-Shot Video Classification via Representation Fusion and Promotion Learning

Recent few-shot video classification (FSVC) works achieve promising performance by capturing similarity across support and query samples with different temporal alignment strategies or learning discriminative features via Transformer block within each episode. However, they ignore two important issues: a) It is difficult to capture rich intrinsic action semantics from a limited number of support instances within each task. b) Redundant or irrelevant frames in videos easily weaken the positive influence of discriminative frames. To address these two issues, this paper proposes a novel Representation Fusion and Promotion Learning (RFPL) mechanism with two sub-modules: meta-action learning (MAL) and reinforced image representation (RIR). Concretely, during training stage, we perform online learning for seeking a task-shared meta-action bank to enrich task-specific action representation by injecting global knowledge. Besides, we exploit reinforcement learning to obtain the importance of each frame and refine the representation. This operation maximizes the contribution of discriminative frames to further capture the similarity of support and query samples from the same category. Our RFPL framework is highly flexible that it can be integrated with many existing FSVC methods. Extensive experiments show that RFPL significantly enhances the performance of existing FSVC models when integrated with them.