Cosine Similarity based Few-Shot Video Classifier with Attention-based Aggregation

Cosine Similarity based Few-Shot Video Classifier with Attention-based Aggregation Meta learning algorithms for few-shot video recognition use complex, episodic training but they often fail to learn effective feature representations. In contrast, we propose a new and simpler few-shot video recognition method that does not use meta-learning, but its performance compares well with the best meta-learning proposals. Our new few-shot video classification pipeline consists of two distinct phases. In the pre-training phase, we learn a good video feature extraction network that generates a feature vector for each video. After a sparse sampling strategy selects frames from the video, we generate a video feature vector from the sampled frames. Our proposed video feature extractor network, which consists of an image feature extraction network followed by a new transformer encoder, is trained end-to-end by including a classifier head that uses cosine similarity layer instead of the traditional linear layer to classify a corpus of labeled video examples. Unlike prior work in meta learning, we do not use episodic training to learn the image feature vector. Also, unlike prior work that averages frame-level feature vectors into a single video feature vector, we combine individual frame-level feature vectors by using a new Transformer encoder that explicitly captures the key, temporal properties in the sequence of sampled frames. End-to-end training of the video feature extractor ensures that the proposed Transformer encoder captures important temporal properties in the video, while the cosine similarity layer explicitly reduces the intra-class variance of videos that belong to the same class. Next, in the few-shot adaptation phase, we use the learned video feature extractor to train a new video classifier by using the few available examples from novel classes. Results on SSV2-100 and Kinetics-100 benchmarks show that our proposed few-shot video classifier outperforms the meta-learning-based methods and achieves the best state-of-the-art accuracy. We also show that our method can easily discern between actions and their inverse (for example, picking something up vs. putting something down), while prior art, which averages image feature vectors, is unable to do so.

Shuffle and Attend: Video Domain Adaptation

Shuffle and Attend: Video Domain Adaptation We address the problem of domain adaptation in videos for the task of human action recognition. Inspired by image-based domain adaptation, we can perform video adaptation by aligning the features of frames or clips of source and target videos. However, equally aligning all clips is sub-optimal as not all clips are informative for the task. As the first novelty, we propose an attention mechanism which focuses on more discriminative clips and directly optimizes for video-level (cf. clip-level) alignment. As the backgrounds are often very different between source and target, the source background-corrupted model adapts poorly to target domain videos. To alleviate this, as a second novelty, we propose to use the clip order prediction as an auxiliary task. The clip order prediction loss, when combined with domain adversarial loss, encourages learning of representations which focus on the humans and objects involved in the actions, rather than the uninformative and widely differing (between source and target) backgrounds. We empirically show that both components contribute positively towards adaptation performance. We report state-of-the-art performances on two out of three challenging public benchmarks, two based on the UCF and HMDB datasets, and one on Kinetics to NEC-Drone datasets. We also support the intuitions and the results with qualitative results.

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition datasets and unannotated (or only a few annotated) videos from drones. To study the emerging problem of drone-based action recognition, we create a new dataset, NEC-DRONE, containing 5,250 videos to evaluate the task. We tackle both problem settings with 1) same and 2) different action label sets for the source (e.g., Kinectics dataset) and target domains (drone videos). We present a combination of video and instance-based adaptation methods, paired with either a classifier or an embedding-based framework to transfer the knowledge from source to target. Our results show that the proposed adaptation approach substantially improves the performance on these challenging and practical tasks. We further demonstrate the applicability of our method for learning cross-view action recognition on the Charades-Ego dataset. We provide qualitative analysis to understand the behaviors of our approaches.