Cosine Similarity based Few-Shot Video Classifier with Attention-based Aggregation

Meta learning algorithms for few-shot video recognition use complex, episodic training but they often fail to learn effective feature representations. In contrast, we propose a new and simpler few-shot video recognition method that does not use meta-learning, but its performance compares well with the best meta-learning proposals. Our new few-shot video classification pipeline consists of two distinct phases. In the pre-training phase, we learn a good video feature extraction network that generates a feature vector for each video. After a sparse sampling strategy selects frames from the video, we generate a video feature vector from the sampled frames. Our proposed video feature extractor network, which consists of an image feature extraction network followed by a new transformer encoder, is trained end-to-end by including a classifier head that uses cosine similarity layer instead of the traditional linear layer to classify a corpus of labeled video examples. Unlike prior work in meta learning, we do not use episodic training to learn the image feature vector. Also, unlike prior work that averages frame-level feature vectors into a single video feature vector, we combine individual frame-level feature vectors by using a new Transformer encoder that explicitly captures the key, temporal properties in the sequence of sampled frames. End-to-end training of the video feature extractor ensures that the proposed Transformer encoder captures important temporal properties in the video, while the cosine similarity layer explicitly reduces the intra-class variance of videos that belong to the same class. Next, in the few-shot adaptation phase, we use the learned video feature extractor to train a new video classifier by using the few available examples from novel classes. Results on SSV2-100 and Kinetics-100 benchmarks show that our proposed few-shot video classifier outperforms the meta-learning-based methods and achieves the best state-of-the-art accuracy. We also show that our method can easily discern between actions and their inverse (for example, picking something up vs. putting something down), while prior art, which averages image feature vectors, is unable to do so.

Application-specific, Dynamic Reservation of 5G Compute and Network Resources by using Reinforcement Learning

5G services and applications explicitly reserve compute and network resources in today’s complex and dynamic infrastructure of multi-tiered computing and cellular networking to ensure application-specific service quality metrics, and the infrastructure providers charge the 5G services for the resources reserved. A static, one-time reservation of resources at service deployment typically results in extended periods of under-utilization of reserved resources during the lifetime of the service operation. This is due to a plethora of reasons like changes in content from the IoT sensors (for example, change in number of people in the field of view of a camera) or a change in the environmental conditions around the IoT sensors (for example, time of the day, rain or fog can affect data acquisition by sensors). Under-utilization of a specific resource like compute can also be due to temporary inadequate availability of another resource like the network bandwidth in a dynamic 5G infrastructure. We propose a novel Reinforcement Learning-based online method to dynamically adjust an application’s compute and network resource reservations to minimize under-utilization of requested resources, while ensuring acceptable service quality metrics. We observe that a complex application-specific coupling exists between the compute and network usage of an application. Our proposed method learns this coupling during the operation of the service, and dynamically modulates the compute and network resource requests to mimimize under-utilization of reserved resources. Through experimental evaluation using real-world video analytics application, we show that our technique is able to capture complex compute-network coupling relationship in an online manner i.e. while the application is running, and dynamically adapts and saves up to 65% compute and 93% network resources on average (over multiple runs), without significantly impacting application accuracy.

Check out our Computervision Papers at #CVPR2022

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. NEC Laboratories Media Analytics department is excited to be presenting at CVPR 2022, being held in New Orleans, LA from Sunday, June 19 to Friday, June 24, 2022.

Towards Learning Disentangled Representations for Time Series

Promising progress has been made toward learning efficient time series representations in recent years, but the learned representations often lack interpretability and do not encode semantic meanings by the complex interactions of many latent factors. Learning representations that disentangle these latent factors can bring semantic-rich representations of time series and further enhance interpretability. However, directly adopting the sequential models, such as Long Short-Term Memory Variational AutoEncoder (LSTM-VAE), would encounter a Kullback?Leibler (KL) vanishing problem: the LSTM decoder often generates sequential data without efficiently using latent representations, and the latent spaces sometimes could even be independent of the observation space. And traditional disentanglement methods may intensify the trend of KL vanishing along with the disentanglement process, because they tend to penalize the mutual information between the latent space and the observations. In this paper, we propose Disentangle Time-Series, a novel disentanglement enhancement framework for time series data. Our framework achieves multi-level disentanglement by covering both individual latent factors and group semantic segments. We propose augmenting the original VAE objective by decomposing the evidence lower-bound and extracting evidence linking factorial representations to disentanglement. Additionally, we introduce a mutual information maximization term between the observation space to the latent space to alleviate the KL vanishing problem while preserving the disentanglement property. Experimental results on five real-world IoT datasets demonstrate that the representations learned by DTS achieve superior performance in various tasks with better interpretability.

CAT: Beyond Efficient Transformer for Content-Aware Anomaly Detection in Event Sequences

It is critical and important to detect anomalies in event sequences, which becomes widely available in many application domains. Indeed, various efforts have been made to capture abnormal patterns from event sequences through sequential pattern analysis or event representation learning. However, existing approaches usually ignore the semantic information of event content. To this end, in this paper, we propose a self-attentive encoder-decoder transformer framework, Content-Aware Transformer CAT, for anomaly detection in event sequences. In CAT, the encoder learns preamble event sequence representations with content awareness, and the decoder embeds sequences under detection into a latent space, where anomalies are distinguishable. Specifically, the event content is first fed to a content-awareness layer, generating representations of each event. The encoder accepts preamble event representation sequence, generating feature maps. In the decoder, an additional token is added at the beginning of the sequence under detection, denoting the sequence status. A one-class objective together with sequence reconstruction loss is collectively applied to train our framework under the label efficiency scheme. Furthermore, CAT is optimized under a scalable and efficient setting. Finally, extensive experiments on three real-world datasets demonstrate the superiority of CAT.

A Deep Learning Framework for Detecting and Localizing Abnormal Pedestrian Behaviors at Grade Crossings

This paper presents a deep learning-based framework to detect and localize the pedestrians’ anomaly behaviors in videos captured at the grade crossing. A skeleton detection and tracking algorithm are employed to capture the key point trajectories of body movements of the pedestrians. A deep recurrent neural network is applied to learn the normal patterns of pedestrians’ movements using dynamics skeleton trajectories features. An anomaly behaviors detection and localization algorithm are developed by analyzing each pedestrian’s reconstructed trajectories. In the experiments, a video dataset involving normal pedestrian behaviors is established by collecting data at multiple grade crossing spots with different camera angles. Then the proposed framework is trained on the dataset to learn the regularity patterns of normal pedestrians and localize the anomaly behaviors during the testing phase. To the best of our knowledge, it is the first attempt to analyze pedestrians’ behavior at a grade crossing. The experimental results show that the proposed framework can detect and localize the anomaly behaviors, such as squatting down, lingering, and other behaviors that may cause safety issues at the grade crossing. Our study also points out the direction for further improvement of the present development to meet the need for real-world applications.

T-Cell Receptor-Peptide Interaction Prediction with Physical Model Augmented Pseudo-Labeling

Predicting the interactions between T-cell receptors (TCRs) and peptides is crucial for the development of personalized medicine and targeted vaccine in immunotherapy. Current datasets for training deep learning models of this purpose remain constrained without diverse TCRs and peptides. To combat the data scarcity issue presented in the current datasets, we propose to extend the training dataset by physical modeling of TCR-peptide pairs. Specifically, we compute the docking energies between auxiliary unknown TCR-peptide pairs as surrogate training labels. Then, we use these extended example-label pairs to train our model in a supervised fashion. Finally, we find that the AUC score for the prediction of the model can be further improved by pseudo-labeling of such unknown TCR-peptide pairs (by a trained teacher model), and re-training the model with those pseudo-labeled TCR-peptide pairs. Our proposed method that trains the deep neural network with physical modeling and data-augmented pseudo-labeling improves over baselines in the available two datasets. We also introduce a new dataset that contains over 80,000 unknown TCR-peptide pairs with docking energy scores.

Rain Intensity Detection and Classification with Pre-existing Telecom Fiber Cables

For the first time, we demonstrate detection and classification of rain intensity using Distributed Acoustic Sensing (DAS). An artificial neural network was applied for rain intensity classification and high precision of over 96% was achieved.

Template Matching Method with Distributed Acoustic Sensing Data and Simulation Data

We propose a new method to detect acoustic signals by matching distributed acoustic sensing data with simulation. In the simulation of the dynamic strain on an optical fiber, the optical fiber layouts and the gauge length are properly incorporated. We apply the proposed method to the acoustic-source localization and demonstrate the method localizes the source accurately even under the layouts which include the straight optical fiber for the sensing points with the large gauge-length settings.

Simultaneous Fiber Sensing and Communications

We review recent advances aimed at increasing the reach of distributed fiber optic sensing with simultaneous data transmission. We review two methods based on measurement of accumulated phase on telecom signals, and chirp-pulsed DAS with inline amplification and frequency diversity.