Field Tests of Impulsive Acoustic Event Detection, Localization, and Classification Over Telecom Fiber Networks

We report distributed-fiber-optic-sensing results on impulsive acoustic events localization/classification over telecom networks. A deep-learning-based model was trained to classify starter-gun and fireworks signatures with high accuracy of > 99% using fiber-based-signal-enhancer and >97% using aerial coils.

Evolution of Fiber Infrastructure – From Data Transmission to Network Sensing

We review multiple use cases over deployed networks including co-existing sensing/data transmission, cable cut prevention and perimeter intrusion detection to realize telecom infrastructure can be sensing backbones instead of the sole function of data transmission.

Mosaic: Leveraging Diverse Reflector Geometries for Omnidirectional Around-Corner Automotive Radar

A large number of traffic collisions occur as a result of obstructed sight lines, such that even an advanced driver assistance system would be unable to prevent the crash. Recent work has proposed the use of around-the-corner radar systems to detect vehicles, pedestrians, and other road users in these occluded regions. Through comprehensive measurement, we show that these existing techniques cannot sense occluded moving objects in many important real-world scenarios. To solve this problem of limited coverage, we leverage multiple, curved reflectors to provide comprehensive coverage over the most important locations near an intersection. In scenarios where curved reflectors are insufficient, we evaluate the relative benefits of using additional flat planar surfaces. Using these techniques, we more than double the probability of detecting a vehicle near the intersection in three real urban locations and enable NLoS radar sensing using an entirely new class of reflectors.

StyleT2I: Towards Compositional and High-Fidelity Text-to-Image Synthesis

Although progress has been made for text-to-image synthesis, previous methods fall short of generalizing to unseen or underrepresented attribute compositions in the input text. Lacking compositionality could have severe implications for robustness and fairness, e.g., inability to synthesize the face images of underrepresented demographic groups. In this paper, we introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis. Specifically, we propose a CLIP-guided Contrastive Loss to better distinguish different compositions among different sentences. To further improve the compositionality, we design a novel Semantic Matching Loss and a Spatial Constraint to identify attributes’ latent directions for intended spatial region manipulations, leading to better disentangled latent representations of attributes. Based on the identified latent directions of attributes, we propose Compositional Attribute Adjustment to adjust the latent code, resulting in better compositionality of image synthesis. In addition, we leverage the l2 -norm regularization of identified latent directions (norm penalty) to strike a nice balance between image-text alignment and image fidelity. In the experiments, we devise a new dataset split and an evaluation metric to evaluate the compositionality of text-to-image synthesis models. The results show that StyleT2I outperforms previous approaches in terms of the consistency between the input text and synthesized images and achieves higher fidelity

Chimera: Context-Aware Splittable Deep Multitasking Models for Edge Intelligence

Design of multitasking deep learning models has mostly focused on improving the accuracy of the constituent tasks, but the challenges of efficiently deploying such models in a device-edge collaborative setup (that is common in 5G deployments) has not been investigated. Towards this end, in this paper, we propose an approach called Chimera 1 for training (done Offline) and deployment (done Online) of multitasking deep learning models that are splittable across the device and edge. In the offline phase, we train our multi-tasking setup such that features from a pre-trained model for one of the tasks (called the Primary task) are extracted and task-specific sub-models are trained to generate the other (Secondary) tasks’ outputs through a knowledge distillation like training strategy to mimic the outputs of pre-trained models for the tasks. The task-specific sub-models are designed to be significantly lightweight than the original pre-trained models for the Secondary tasks. Once the sub-models are trained, during deployment, for given deployment context, characterized by the configurations, we search for the optimal (in terms of both model performance and cost) deployment strategy for the generated multitasking model, through finding one or multiple suitable layer(s) for splitting the model, so that inference workloads are distributed between the device and the edge server and the inference is done in a collaborative manner. Extensive experiments on benchmark computer vision tasks demonstrate that Chimera generates splittable multitasking models that are at least ~ 3 x parameter efficient than the existing such models, and the end-to-end device-edge collaborative inference becomes ~ 1.35 x faster with our choice of context-aware splitting decisions.

Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts

We propose an end-to-end network that takes a single perspective RGB image of a complex road scene as input, to produce occlusion-reasoned layouts in perspective space as well as a parametric bird’s-eye-view (BEV) space. In contrast to prior works that require dense supervision such as semantic labels in perspective view, our method only requires human annotations for parametric attributes that are cheaper and less ambiguous to obtain. To solve this challenging task, our design is comprised of modules that incorporate inductive biases to learn occlusion-reasoning, geometric transformation and semantic abstraction, where each module may be supervised by appropriately transforming the parametric annotations. We demonstrate how our design choices and proposed deep supervision help achieve meaningful representations and accurate predictions. We validate our approach on two public datasets, KITTI and NuScenes, to achieve state-of-the-art results with considerably less human supervision.

Self-supervised Video Representation Learning with Cascade Positive Retrieval

Self-supervised video representation learning has been shown to effectively improve downstream tasks such as video retrieval and action recognition. In this paper, we present the Cascade Positive Retrieval (CPR) that successively mines positive examples w.r.t. the query for contrastive learning in a cascade of stages. Specifically, CPR exploits multiple views of a query example in different modalities, where an alternative view may help find another positive example dissimilar in the query view. We explore the effects of possible CPR configurations in ablations including the number of mining stages, the top similar example selection ratio in each stage, and progressive training with an incremental number of the final Top-k selection. The overall mining quality is measured to reflect the recall across training set classes. CPR reaches a median class mining recall of 83.3%, outperforming previous work by 5.5%. Implementation-wise, CPR is complementary to pretext tasks and can be easily applied to previous work. In the evaluation of pretraining on UCF101, CPR consistently improves existing work and even achieves state-of-the-art R@1 of 56.7% and 24.4% in video retrieval as well as 83.8% and 54.8% in action recognition on UCF101 and HMDB51. The code is available at https://github.com/necla-ml/CPR.

On Generalizing Beyond Domains in Cross-Domain Continual Learning

Humans have the ability to accumulate knowledge of new tasks in varying conditions, but deep neural networks of-ten suffer from catastrophic forgetting of previously learned knowledge after learning a new task. Many recent methods focus on preventing catastrophic forgetting under the assumption of train and test data following similar distributions. In this work, we consider a more realistic scenario of continual learning under domain shifts where the model must generalize its inference to an unseen domain. To this end, we encourage learning semantically meaningful features by equipping the classifier with class similarity metrics as learning parameters which are obtained through Mahalanobis similarity computations. Learning of the backbone representation along with these extra parameters is done seamlessly in an end-to-end manner. In addition, we propose an approach based on the exponential moving average of the parameters for better knowledge distillation. We demonstrate that, to a great extent, existing continual learning algorithms fail to handle the forgetting issue under multiple distributions, while our proposed approach learns new tasks under domain shift with accuracy boosts up to 10% on challenging datasets such as DomainNet and OfficeHome.

MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation

Test-time adaptation approaches have recently emerged as a practical solution for handling domain shift without access to the source domain data. In this paper, we propose and explore a new multi-modal extension of test-time adaptation for 3D semantic segmentation. We find that, directly applying existing methods usually results in performance instability at test time, because multi-modal input is not considered jointly. To design a framework that can take full advantage of multi-modality, where each modality provides regularized self-supervisory signals to other modalities, we propose two complementary modules within and across the modalities. First, Intra-modal Pseudo-label Generation (Intra-PG) is introduced to obtain reliable pseudo labels within each modality by aggregating information from two models that are both pre-trained on source data but updated with target data at different paces. Second, Inter-modal Pseudo-label Refinement (Inter-PR) adaptively selects more reliable pseudo labels from different modalities based on a proposed consistency scheme. Experiments demonstrate that our regularized pseudo labels produce stable self-learning signals in numerous multi-modal test-time adaptation scenarios for 3D semantic segmentation. Visit our project website at https://www.nec-labs.com/~mas/MM-TTA

Learning to Learn across Diverse Data Biases in Deep Face Recognition

Convolutional Neural Networks have achieved remarkable success in face recognition, in part due to the abundant availability of data. However, the data used for training CNNs is often imbalanced. Prior works largely focus on the long-tailed nature of face datasets in data volume per identity or focus on single bias variation. In this paper, we show that many bias variations such as ethnicity, head pose, occlusion and blur can jointly affect the accuracy significantly. We propose a sample level weighting approach termed Multi-variation Cosine Margin (MvCoM), to simultaneously consider the multiple variation factors, which orthogonally enhances the face recognition losses to incorporate the importance of training samples. Further, we leverage a learning to learn approach, guided by a held-out meta learning set and use an additive modeling to predict the MvCoM. Extensive experiments on challenging face recognition benchmarks demonstrate the advantages of our method in jointly handling imbalances due to multiple variations.