Posts

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction

Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform $ extit(Unknown sysvar: (pseudo))$ RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires $ extit(Unknown sysvar: ( unlabeled monocular))$ videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised $ extit(Unknown sysvar: (monocular))$ and $ extit(Unknown sysvar: (stereo))$ depth prediction networks (e.g, Monodepth2) and feature based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.

Domain Adaptive Semantic Segmentation using Weak Labels

We propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human oracle in a new weakly-supervised domain adaptation (WDA) paradigm for semantic segmentation. Using weak labels is both practical and useful, since (i) collecting image-level target annotations is comparably cheap in WDA and incurs no cost in UDA, and (ii) it opens the opportunity for category-wise domain alignment. Our framework uses weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in the process of domain adaptation. Specifically, we develop a weak-label classification module to enforce the network to attend to certain categories, and then use such training signals to guide the proposed category-wise alignment method. In experiments, we show considerable improvements with respect to the existing state-of-the-arts in UDA and present a new benchmark in the WDA setting.

Improving Face Recognition by Clustering Unlabeled Faces in the Wild

While deep face recognition has benefited significantly from large-scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data sets have no overlapping identities by construction. This is not realistic in large-scale face recognition, where one must contend with such overlaps, the frequency of which increases with the volume of data. Ignoring identity overlap leads to significant labeling noise, as data from the same identity is split into multiple clusters. To address this, we propose a novel identity separation method based on extreme value theory. It is formulated as an out-of-distribution detection algorithm, and greatly reduces the problems caused by overlapping-identity label noise. Considering cluster assignments as pseudo-labels, we must also overcome the labeling noise from clustering errors. We propose a modulation of the cosine loss, where the modulation weights correspond to an estimate of clustering uncertainty. Extensive experiments on both controlled and real settings demonstrate our method’s consistent improvements over supervised baselines, e.g., 11.6% improvement on IJB-A verification.

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant ( g $O(100)$) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a “global” loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.

Object Detection with a Unified Label Space from Multiple Datasets

Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant—application-relevant categories can be picked and merged form arbitrary existing datasets. However, naive merging of datasets is not possible in this case, due to inconsistent object annotations. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset, although the object itself appears in the later’s images. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels. Evaluation in the proposed novel setting requires full annotation on the test set. We collect the required annotations and define a new challenging experimental setup for this task based on existing public datasets. We show improved performances compared to competitive baselines and appropriate adaptations of existing work

SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction

We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data, which is difficult to acquire with sufficient coverage of possible modes. Our first contribution is an automatic method to simulate diverse trajectories in the top-view. It uses pre-existing datasets and maps as initialization, mines existing trajectories to represent realistic driving behaviors and uses a multi-agent vehicle dynamics simulator to generate diverse new trajectories that cover various modes and are consistent with scene layout constraints. Our second contribution is a novel method that generates diverse predictions while accounting for scene semantics and multi-agent interactions, with constant-time inference independent of the number of agents. We propose a convLSTM with novel state pooling operations and losses to predict scene-consistent states of multiple agents in a single forward pass, along with a CVAE for diversity. We validate our proposed multi-agent trajectory prediction approach by training and testing on the proposed simulated dataset and existing real datasets of traffic scenes. In both cases, our approach outperforms SOTA methods by a large margin, highlighting the benefits of both our diverse dataset simulation and constant-time diverse trajectory prediction methods.”

Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv)

Read Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv). While deep face recognition has benefited significantly from large scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data sets have no overlapping identities by construction. This is not realistic in large scale face recognition, where one must contend with such overlaps, the frequency of which increases with the volume of data. Ignoring identity overlap leads to significant labeling noise, as data from the same identity is split into multiple clusters. To address this, we propose a novel identity separation method based on extreme value theory. It is formulated as an out of distribution detection algorithm, and greatly reduces the problems caused by overlapping identity label noise. Considering cluster assignments as pseudo labels, we must also overcome the labeling noise from clustering errors. We propose a modulation of the cosine loss, where the modulation weights correspond to an estimate of clustering uncertainty. Extensive experiments on both controlled and real settings demonstrate our method’s consistent improvements over supervised baselines, e.g., 11.6% improvement on IJB A verification.

Understanding Road Layout from Videos as a Whole

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cues and incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. Our model consists of one LSTM and one Feature Transform Module (FTM). The former implicitly incorporates the consistency constraint with its hidden states, and the latter explicitly takes the camera motion into consideration when aggregating information along videos. Moreover, we propose to incorporate context information by introducing road participants, e.g. objects, into our model. When the entire video sequence is available, our model is also able to encode both local and global cues, e.g. information from both past and future frames. Experiments on two data sets show that: (1) Incorporating either global or contextual cues improves the prediction accuracy and leveraging both gives the best performance. (2) Introducing the LSTM and FTM modules improves the prediction consistency in videos. (3) The proposed method outperforms the SOTA by a large margin.

Towards Universal Representation Learning for Deep Face Recognition

Recognizing wild faces is extremely hard as they appear with all kinds of variations. Traditional methods either train with specifically annotated variation data from target domains, or by introducing unlabeled target variation data to adapt from the training data. Instead, we propose a universal representation learning framework that can deal with larger variation unseen in the given training data without leveraging target domain knowledge. We firstly synthesize training data alongside some semantically meaningful variations, such as low resolution, occlusion and head pose. However, directly feeding the augmented data for training will not converge well as the newly introduced samples are mostly hard examples. We propose to split the feature embedding into multiple sub-embeddings, and associate different confidence values for each sub-embedding to smooth the training procedure. The sub-embeddings are further decorrelated by regularizing variation classification loss and variation adversarial loss on different partitions of them. Experiments show that our method achieves top performance on general face recognition datasets such as LFW and MegaFace, while significantly better on extreme benchmarks such as TinyFace and IJB-S.

Peek-a-boo: Occlusion Reasoning in Indoor Scenes with Plane Representations

We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts of the plane. We infer these planes from a single input image with a novel neural network architecture. It consists of a two-branch category-specific module that aims to predict layout and objects of the scene separately so that different types of planes can be handled better. We also introduce a novel loss function based on plane warping that can leverage multiple views at training time for improved occlusion-aware reasoning. In order to train and evaluate our occlusion-reasoning model, we use the ScanNet dataset and propose (i) a strategy to automatically extract ground truth for both visible and hidden regions and (ii) a new evaluation metric that specifically focuses on the prediction in hidden regions. We empirically demonstrate that our proposed approach can achieve higher accuracy for occlusion reasoning compared to competitive baselines on the ScanNet dataset, e.g. 42.65% relative improvement on hidden regions.