Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive