Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Publication Date: 6/18/2024

Event: CVPR 2024 3rd Workshop on Learning with Limited Labelled Data for Image and Video Understanding

Reference: pp. 2704-2713, 2024

Authors: Xin Hu, Tulane University; Kai Li, NEC Laboratories America, Inc.; Deep Patel, NEC Laboratories America, Inc.; Erik Kruus, NEC Laboratories America, Inc.; Martin Renqiang Min, NEC Laboratories America, Inc.; Zhengming Ding, Tulane University

Abstract: Weakly Supervised Temporal Action Localization (WSTAL) aims to jointly localize and classify action segments in untrimmed videos with only video level annotations. To leverage video level annotations most existing methods adopt the multiple instance learning paradigm where frame/snippet level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships we argue that those implementations have not sufficiently exploited such information. In this paper we propose Multi Modal Plateau Transformers (M2PT) for WSTAL by simultaneously exploiting temporal relationships among snippets complementary information across data modalities and temporal coherence among consecutive snippets. Specifically M2PT explores a dual Transformer architecture for RGB and optical flow modalities which models intra modality temporal relationship with a self attention mechanism and inter modality temporal relationship with a cross attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state of the art performance.

Publication Link: https://openaccess.thecvf.com/content/CVPR2024W/L3D-IVU/papers/Hu_Weakly-Supervised_Temporal_Action_Localization_with_Multi-Modal_Plateau_Transformers_CVPRW_2024_paper.pdf