We conduct research in computer vision and machine learning, with a focus on sustaining excellence in two main directions: visual recognition and 3D scene understanding. Key applications of our research include visual surveillance and autonomous driving. We tackle fundamental problems in computer vision, such as object detection, semantic segmentation, face recognition, 3D reconstruction and behavior prediction. We develop and leverage breakthroughs in deep learning, particularly with a flavor of weak supervision, metric learning and domain adaptation.

Prediction and Understanding in Complex Scenes

Behavior Prediction in Complex Scenes
Predicting the future in complex traffic scenes must account for multimodality and make long-term strategic decisions based on history and interactions among multiple agents. Our DESIRE framework generates diverse future samples using a conditional variational auto-encoder, that are ranked and refined by an RNN scoring-regression module. A CNN fusion layer incorporates interactions between agents and semantic scene elements. The ranking objective accounts for potential future rewards similar to inverse optimal control, allowing generalization to new situations further in the future.

Appears in DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents, CVPR 2017
Scene Recognition with Co-occurrences
Multiple events co-occur simultaneously in complex inner-city traffic scenes. Our traffic scene recognition uses a decomposition into high-order scenes that carry semantic meaning and atomic scenes that are determined by 3D localization of individual objects. Our hierarchical model captures co-occurrence and mutual exclusion relationships while incorporating both low-level trajectory features and high-level scene features, with parameters learned using a structured support vector machine. Our efficient inference exploits the structure of the model to obtain real-time rates.

Appears in Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos, WACV 2016
Weakly Supervised Semantic Segmentation
Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data. Given cheaply obtained sparse image labelings, we propagate the sparse labels to produce guessed dense labelings. A CNN-based segmentation network is trained to mimic these labelings. The label-propagation process is defined using random-walk hitting probabilities, which leads to a differentiable parameterization with uncertainty estimates that are incorporated into our loss. We show that by learning the label-propagator jointly with the segmentation predictor, we effectively learn semantic edges given no direct edge supervision.

Appears in Learning random-walk label propagation for weakly-supervised semantic segmentation, CVPR 2017
Prototype for ADAS and Self-Driving
We implement and test our algorithms for real-time road scene understanding on an automobile platform. Our sensor suite for data acquisition includes LIDAR, cameras, IMU and GPS units. Our algorithms run in real-time, achieving at least 10 frames per second on Nvidia GTX 1080 GPUs. In particular, our focus is on demonstrating that our visual 3D scene understanding and distant future prediction enable applications such as early warning for dangerous situations, surpassing the capabilities of existing ADAS systems.


Object Detection

Scale-Dependent Pooling for Object Detection
We propose two new strategies to detect objects accurately and efficiently using deep convolutional neural networks. Scale-dependent pooling improves detection accuracy by exploiting appropriate convolutional features depending on the scale of candidate object proposals. Cascaded rejection classifiers effectively utilize convolutional features and eliminate negative object proposals in a cascaded manner, which greatly speeds up detection while maintaining high accuracy.

Appears in Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers, CVPR 2016
Data-Driven 3D Voxel Patterns
Our novel object representation, 3D Voxel Pattern (3DVP), jointly encodes the key properties of objects including appearance, 3D shape, viewpoint, occlusion and truncation. We discover 3DVPs in a data-driven way and train a bank of specialized detectors for a dictionary of 3DVPs. The 3DVP detectors are capable of detecting objects with specific visibility patterns and transferring meta-data from the 3DVPs to the detected objects, such as 2D segmentation mask, 3D pose as well as occlusion or truncation boundaries.

Appears in Data-Driven 3D Voxel Patterns for Object Category Recognition, CVPR 2015
Subcategory CNNs for Region Proposals
Generating region proposals is a bottleneck for CNN-based object detection when objects exhibit significant scale variation, occlusion or truncation. We propose a novel region proposal network for CNN-based object detection that uses subcategory information related to object pose to guide the proposal generation process. This leads to a new detection network for joint detection and classification into pose subcategories.

Appears in Subcategory-aware Convolutional Neural Networks for Object Proposals and Detection, WACV 2017


Multi-Target Tracking

Online Multi-Target Tracking
We focus on two key aspects of the multi-target tracking problem. Our Aggregated Local Flow Descriptor encodes the relative motion pattern between a pair of temporally distant detections using long-term interest point trajectories, to provide a robust affinity measure for matching detections. We also formulate online tracking as a data-association between targets and detections in a temporal window, which is efficient and achieves robustness by integrating several other cues such as target dynamics, appearance similarity and long term trajectory regularization.

Appears in Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor, ICCV 2015
Deep Network Flows for Multi-Target Tracking
The association problem in tracking-by-detection approaches is formulated as network flow optimization using linear programs, for which previous works use hand-crafted or linear cost functions. We propose a novel formulation to learn arbitrarily parameterized but differentiable cost functions. In particular, we use deep neural networks to predict costs and bi-level optimization to minimize a loss defined on the solution of the linear program. Besides eliminating hand-designed costs, this also allows easy integration of various sources of input.

Appears in Deep Network Flow for Multi-Object Tracking, CVPR 2017


Object and Face Recognition

Feature Disentanglement for Pose-Invariant Face Recognition
Face recognition under large pose variations is a persistent challenge due to under-representation in training data. We learn a CNN-based feature representation that is invariant to pose, without requiring extensive pose coverage in training data. We seek a rich embedding that encodes identity features, as well as non-identity ones such as pose and landmarks. Our metric learning explicitly disentangles identity and pose, by demanding alignment between feature reconstructions through various combinations of identity and pose features. We achieve state-of-art performances on Multi-PIE, 300W-LP and CFP datasets.

Domain Adaptation for Recognition in Unlabeled Videos
There remains a clear gap between the performance of image-based and video-based face recognition, due to the vast difference in visual quality between the domains and the difficulty of curating diverse large-scale video datasets. We address both challenges through an image to video feature-level domain adaptation approach, to learn discriminative video frame representations. We only use large-scale unlabeled video data to reduce the domain gap while transferring discriminative knowledge and achieve state-of-the-art results on the YouTube Faces dataset.
Liveness Detection for Secure Face Authentication
A secure face recognition system also needs liveness detection, to determine whether the input corresponds to a genuine user. A secure system must be robust to sophisticated attacks, where the adversary might use face images, displays, 3D masks, or other means. Our novel deep learning engines for liveness detection successfully handle adversarial attacks arising from diverse sources.


Metric Learning

Deep Metric Learning with Multiclass Objectives
Deep metric learning based on contrastive or triplet loss suffers from slow convergence, since only one negative example is used while not interacting with other negative classes in each update. Our new metric learning objective, multi-class N-pair loss, generalizes triplet loss by allowing joint comparisons among N-1 negative examples and reduces computation for evaluating deep embedding vectors with an efficient batch construction strategy. This improves performance for several applications such as fine-grained recognition, clustering, retrieval and face recognition.

Appears in Improved Deep Metric Learning with Multi-class N-pair Loss Objective, NIPS 2016
Deep Metric Learning for Correspondence
We present deep metric learning to obtain a feature space that preserves geometric or semantic similarity. Our visual correspondences span across rigid motions to intra-class shape or appearance variations. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation and linear time testing instead of quadratic time for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT.

Appears in Universal Correspondence Network, NIPS 2016


Correspondence and Landmark Localization

Weakly Supervised Correspondence
Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another, by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation.

Appears in WarpNet: Weakly Supervised Matching for Single-view Reconstruction, CVPR 2016
Shape Bases for Articulated Pose Estimation
Our deep deformation network is a novel cascaded structure for localizing landmarks in non-rigid objects. A shape basis network combines the benefits of CNN features and a learned shape basis to reduce the complexity of the highly nonlinear pose manifold. A point transformer network estimates local deformations parameterized as thin-plate spline transformations for a finer refinement. It achieves state-of-the-art performances on several benchmarks for facial landmark localization, human body pose estimation and bird part localization.

Appears in Deep Deformation Network for Object Landmark Localization, ECCV 2016
Deep Supervision for Semantic Object Parsing
We propose a deep CNN architecture to localize object semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. We exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer a causal sequence of intermediate concepts. We render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. The utility of our deep supervision is demonstrated by state-of-the-art performances on real image benchmarks for 2D and 3D keypoint localization and instance segmentation.

Appears in Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing, CVPR 2017


Real-Time SFM and 3D Localization

Real-Time Monocular Structure from Motion
We propose the first real-time monocular SFM system that achieved performance comparable to stereo on large-scale datasets like KITTI. It corrects for scale drift using a novel cue combination framework for ground plane estimation, using multiple cues like sparse features, dense inter-frame stereo and when applicable object detection. A data-driven mechanism learns models that relate observation covariances for each cue to error behavior of its underlying variables. During testing, this allows per-frame adaptation of observation covariances based on relative confidences inferred from visual data, boosting the accuracy of both SFM and 3D object localization.

Appears in Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving, CVPR 2014
Monocular 3D Localization in Road Scenes
We propose a single-camera system for highly accurate 3D localization of objects like cars in autonomous driving applications. It jointly uses information from complementary modalities such as structure from motion (SFM) and object detection to achieve high localization accuracy in both near and far fields. We make novel use of raw detection scores to allow our 3D bounding boxes to adapt to better quality 3D cues. Our formulation can be regarded as an extension of sparse bundle adjustment to incorporate object detection cues.

Appears in Joint SFM and Detection Cues for Monocular 3D Localization in Road Scenes, CVPR 2015
Occlusion Models for 3D Scene Understanding
We present a physically interpretable, continuous 3D model for handling occlusions in road scenes. We probabilistically assign each point in space to an object with a theoretical modeling of the reflection and transmission probabilities for the corresponding camera ray. We handle both SFM point tracks with occluding objects and object detection scores in 3D localization. Our model uniformly handles static and dynamic objects, which is an advantage over motion segmentation approaches traditionally used in multibody SFM.

Appears in A Continuous Occlusion Model for Road Scene Understanding, CVPR 2016