The Trade-off between Scanning Beam Penetration and Transmission Beam Gain in mmWave Beam Alignment

Beam search algorithms have been proposed to align the beams from an access point to a user equipment. The process relies on sending beams from a set of scanning beams (SB) and tailoring a transmission beam (TB) using the received feedback. In this paper, we discuss a fundamental trade-off between the gain of SBs and TBs. The higher the gain of an SB, the better the penetration of the SB and the higher the gain of the TB the better the communication link performance. However, TB depends on the set of SBs and by increasing the coverage of each SB and in turn reducing its penetration, there is more opportunity to find a sharper TB to increase its beamforming gain. We define a quantitative measure for such trade-off in terms of a trade-off curve. We introduce SB set design namely Tulip design and formally prove it achieves this fundamental trade-off curve for channels with a single dominant path. We also find closed-form solutions for the trade-off curve for special cases and provide an algorithm with its performance evaluation results to find the trade-off curve revealing the need for further optimization on the SB sets in the state-of-the-art beam search algorithms.

Single-Stream Multi-level Alignment for Vision-Language Pretraining

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at zaidkhan.me/SIMLA.

Learning Semantic Segmentation from Multiple Datasets with Label Shifts

While it is desirable to train segmentation models on an aggregation of multiple datasets, a major challenge is that the label space of each dataset may be in conflict with one another. To tackle this challenge, we propose UniSeg, an effective and model-agnostic approach to automatically train segmentation models across multiple datasets with heterogeneous label spaces, without requiring any manual relabeling efforts. Specifically, we introduce two new ideas that account for conflicting and co-occurring labels to achieve better generalization performance in unseen domains. First, we identify a gradient conflict in training incurred by mismatched label spaces and propose a class-independent binary cross-entropy loss to alleviate such label conflicts. Second, we propose a loss function that considers class-relationships across datasets for a better multi-dataset training scheme. Extensive quantitative and qualitative analyses on road-scene datasets show that UniSeg improves over multi-dataset baselines, especially on unseen datasets, e.g., achieving more than 8%p gain in IoU on KITTI. Furthermore, UniSeg achieves 39.4% IoU on the WildDash2 public benchmark, making it one of the strongest submissions in the zero-shot setting. Our project page is available at https://www.nec-labs.com/~mas/UniSeg.

Learning Phase Mask for Privacy-Preserving Passive Depth Estimation

With over a billion sold each year, cameras are not only becoming ubiquitous, but are driving progress in a wide range of domains such as mixed reality, robotics, and more. However, severe concerns regarding the privacy implications of camera-based solutions currently limit the range of environments where cameras can be deployed. The key question we address is: Can cameras be enhanced with a scalable solution to preserve users’ privacy without degrading their machine intelligence capabilities? Our solution is a novel end-to-end adversarial learning pipeline in which a phase mask placed at the aperture plane of a camera is jointly optimized with respect to privacy and utility objectives. We conduct an extensive design space analysis to determine operating points with desirable privacy-utility tradeoffs that are also amenable to sensor fabrication and real-world constraints. We demonstrate the first working prototype that enables passive depth estimation while inhibiting face identification.

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

Why is the video analytics accuracy fluctuating, and what can we do about it?

It is a common practice to think of a video as a sequence of images (frames), and re-use deep neural network models that are trained only on images for similar analytics tasks on videos. In this paper, we show that this “leap of faith” that deep learning models that work well on images will also work well on videos is actually flawed. We show that even when a video camera is viewing a scene that is not changing in any human-perceptible way, and we control for external factors like video compression and environment (lighting), the accuracy of video analytics application fluctuates noticeably. These fluctuations occur because successive frames produced by the video camera may look similar visually but are perceived quite differently by the video analytics applications. We observed that the root cause for these fluctuations is the dynamic camera parameter changes that a video camera automatically makes in order to capture and produce a visually pleasing video. The camera inadvertently acts as an “unintentional adversary” because these slight changes in the image pixel values in consecutive frames, as we show, have a noticeably adverse impact on the accuracy of insights from video analytics tasks that re-use image-trained deep learning models. To address this inadvertent adversarial effect from the camera, we explore the use of transfer learning techniques to improve learning in video analytics tasks through the transfer of knowledge from learning on image analytics tasks. Our experiments with a number of different cameras, and a variety of different video analytics tasks, show that the inadvertent adversarial effect from the camera can be noticeably offset by quickly re-training the deep learning models using transfer learning. In particular, we show that our newly trained Yolov5 model reduces fluctuation in object detection across frames, which leads to better tracking of objects (∼40% fewer mistakes in tracking). Our paper also provides new directions and techniques to mitigate the camera’s adversarial effect on deep learning models used for video analytics applications.

Field Trials of Vibration Detection, Localization and Classification over Deployed Telecom Fiber Cables

We review sensing fusion results of integrating fiber sensing with video for machine-learning-based localization and classification of impulsive acoustic event detection. Classification accuracy >97% was achieved on aerial coils, and >99% using fiber-based signal enhancers.

Efficient Compression Method for Roadside LiDAR Data

Roadside LiDAR (Light Detection and Ranging) sensors are recently being explored for intelligent transportation systems aiming at safer and faster traffic management and vehicular operations. A key challenge in such systems is to efficiently transfer massive point-cloud data from the roadside LiDAR devices to the edge connected through a 5G network for real-time processing. In this paper, we consider the problem of compressing roadside (i.e. static) LiDAR data in real-time that provides a unique condition unexplored by current methods. Existing point-cloud compression methods assume moving LiDARs (that are mounted on vehicles) and do not exploit spatial consistency across frames over time.To this end, we develop a novel grouped wavelet technique for static roadside LiDAR data compression (i.e. SLiC). Our method compresses LiDAR data both spatially and temporally using a kd-tree data structure based on Haar wavelet coefficients. Experimental results show that SLiC can compress up to 1.9× more effectively than the state-of-the-art compression method can do. Moreover, SLiC is computationally more efficient to achieve 2× improvement in bandwidth usage over the best alternative. Even with this impressive gain in communication and storage efficiency, SLiC retains down-the-pipeline application’s accuracy.

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects. We approach the task by modeling the video as tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, prior works suffer from scene biases with privacy and ethical concerns. We only use the keypoint modality which reduces scene biases and prevents acquiring detailed visual data that may contain private or biased information of users. We improve the multiscale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and data augmentations tailored to the keypoint signals to aid model training. We demonstrate the model’s strength and interpretability on two widely-used datasets (Volleyball and Collective Activity). COMPOSER achieves up to +5.4% improvement with just the keypoint modality (Code is available at https://github.com/hongluzhou/composer.).

Unsupervised Anomaly Detection with Self-Training and Knowledge Distillation

Anomaly Detection (AD) aims to find defective patterns or abnormal samples among data, and has been a hot research topic due to various real-world applications. While various AD methods have been proposed, most of them assume the availability of a clean (anomaly-free) training set, which, however, may be hard to guarantee in many real-world industry applications. This motivates us to investigate Unsupervised Anomaly Detection (UAD) in which the training set includes both normal and abnormal samples. In this paper, we address the UAD problem by proposing a Self-Training and Knowledge Distillation (STKD) model. STKD combats anomalies in the training set by iteratively alternating between excluding samples of high anomaly probabilities and training the model with the purified training set. Despite that the model is trained with a cleaner training set, the inevitably existing anomalies may still cause negative impact. STKD alleviates this by regularizing the model to respond similarly to a teacher model which has not been trained with noisy data. Experiments show that STKD consistently produces more robust performance with different levels of anomalies.