We conduct research in computer vision and machine learning, with a focus on sustaining excellence in three main directions: (1) scene understanding; (2) recognition and representation; and (3) adaptation, fairness and privacy. Key applications of our research include visual surveillance and autonomous driving. We tackle fundamental problems in computer vision, such as object detection, semantic segmentation, face recognition, 3D reconstruction and behavior prediction. We develop and leverage breakthroughs in deep learning, particularly with a flavor of weak supervision, metric learning and domain adaptation.

CVPR 2017 | Learning random-walk label propagation for weakly-supervised semantic segmentation
Paul Vernaza, Manmohan Chandraker

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data. Given cheaply obtained sparse image labelings, we propagate the sparse labels to produce guessed dense labelings using random-walk hitting probabilities, which leads to a differentiable parameterization with uncertainty estimates that are incorporated into our loss. We show that our method can effectively learn semantic edges given no direct edge supervision. 

CVPR 2017 | DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, Manmohan Chandraker

We introduce a Deep Stochastic IOC RNN Encoder- decoder framework, DESIRE, for the task of future prediction of multiple interacting agents in dynamic scenes. It produces accurate future predictions by tackling multi-modality of futures while accounting for a rich set of both static and dynamic scene contexts. It generates a diverse set of hypothetical prediction samples, and then ranks and refines them through a deep IOC network. 

CVPR 2017 | Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
Chi Li, Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker

We propose a deep CNN architecture to localize object semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. We exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer a causal sequence of intermediate concepts. We render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. The utility of our deep supervision is demonstrated by state-of-the-art performances on real image benchmarks for 2D and 3D keypoint localization and instance segmentation. 

PDF | Dataset
CVPR 2017 | Deep Network Flow for Multi-Object Tracking
Samuel Schulter, Paul Vernaza, Wongun Choi, Manmohan Chandraker

We demonstrate that it is possible to learn features for network-flow-based data association via backpropagation, by expressing the optimum of a smoothed network flow problem as a differentiable function of the pairwise association costs. We apply this approach to multi-object tracking with a network flow formulation. Our experiments demonstrate that we are able to successfully learn all cost functions for the association problem in an end-to-end fashion, which outperform hand-crafted costs in all settings. The integration and combination of various sources of inputs become easy and the cost functions can be learned entirely from data, alleviating tedious hand-designing of costs. 

NeurIPS 2017 | Learning Efficient Object Detection Models with Knowledge Distillation
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, Manmohan Chandraker

Deep object detectors require prohibitive runtimes to process an image for real-time applications. Model compression can learn compact models with fewer number of parameters, but accuracy is significantly degraded. In this work, we propose a new framework to learn compact and fast object detection networks with improved accuracy using knowledge distillation and hint learning. Our results show consistent improvement in accuracy-speed trade-off across PASCAL, KITTI, ILSVRC and MS-COCO. 

NeurIPS 2016 | Improved Deep Metric Learning with Multi-class N-pair Loss Objective
Kihyuk Sohn

We tackle the problem of unsatisfactory convergence of training a deep neural network for metric learning by proposing multi-class N-pair loss. Unlike many other objective functions that ignore the information lying in the interconnections between the samples, N-pair loss utilizes full interaction of the examples from different classes within a batch. We also propose an efficient batch construction strategy using only N pairs of examples. 

NeurIPS 2016 | Universal Correspondence Network
Christopher B. Choy, JunYoung Gwak, Silvio Savarese, Manmohan Chandraker

We present deep metric learning to obtain a feature space that preserves geometric or semantic similarity. Our visual correspondences span across rigid motions to intra-class shape or appearance variations. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation and linear time testing instead of quadratic time for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT.  

PDF | Supplementary | Project Site | Code
ECCV 2016  | Attribute2Image: Conditional Image Generation from Visual Attributes
Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee

We investigate the novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion. 

ECCV 2016  | A 4D Light-Field Dataset and CNN Architectures for Material Recognition
Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker, Alexei Efros, Ravi Ramamoorthi

We introduce a new light-field dataset of materials, and take advantage of the recent success of deep learning to perform material recognition on the 4D light-field. Our dataset contains 12 material categories, each with 100 images taken with a Lytro Illum, from which we extract about 30,000 patches in total. To the best of our knowledge, this is the first mid-size dataset for light-field images. Our main goal is to investigate whether the additional information in a light-field (such as multiple sub-aperture views and view-dependent reflectance effects) can aid material recognition. Since recognition networks have not been trained on 4D images before, we propose and compare several novel CNN architectures to train on light-field images. 

ECCV 2016  | Deep Deformation Network for Object Landmark Localization
Xiang Yu, Feng Zhou, Manmohan Chandraker

We propose a cascaded framework for localizing landmarks in non-rigid objects. The first stage initializes the shape as constrained to lie within a low-rank manifold, and the second stage estimates local deformations parameterized as thin-plate spline transformations. Since our framework does not incorporate either handcrafted features or part connectivity, it is easy to train and test, and generally applicable to various object types. 

CVPR 2016 | Embedding Label Structures for Fine-Grained Feature Representation
Xiaofan Zhang, Feng Zhou, Yuanqing Lin, Shaoting Zhang

We model the multi-level relevance among fine-grained classes for fine-grained categorization. We jointly optimize classification and similarity constraints in a proposed multi-task learning framework, and embed label structures such as hierarchy or shared attributes into the framework by generalizing the triplet loss. This significantly outperforms previous fine-grained feature representations for image retrieval at different levels of relevance.

CVPR 2016 | Fine-grained Image Classification by Exploring Bipartite-Graph Labels
Feng Zhou, Yuanqing Lin

We exploit the rich relationships among the fine-grained classes for fine-grained image classification. We model the relations using the proposed bipartite-graph labels (BGL) and incorporate it into CNN training. Our system is computationally efficient in inference thanks to the bipartite structure. Also, we construct a new food benchmark dataset, consisting of 37,885 food images collected from 6 restaurants and totally 975 menus.

CVPR 2016 | Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers
Fan Yang, Wongun Choi, Yuanqing Lin

We propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Second, we construct an efficient region-of-interest segmentation network to generate part masks, with a similarity-based scoring function to refine these object parts and generate final segmentation outputs. 

CVPR 2016 | A Continuous Occlusion Model for Road Scene Understanding
Vikas Dhiman, Quoc-Huy Tran, Jason Corso, Manmohan Chandraker

We present a physically interpretable, 3D model for handling occlusions with applications to road scene understanding. Given object detection and SFM point tracks, our unified model probabilistically assigns point tracks to objects and reasons about object detection scores and bounding boxes. It uniformly handles static and dynamic objects, thus, outperforms motion segmentation for association problems. Further, also demonstrate occlusion-aware 3D localization in road scenes. 

CVPR 2016 | SVBRDF-Invariant Shape and Reflectance Estimation from Light-Field Cameras
Ting-Chun Wang, Manmohan Chandraker, Alexei Efros, Ravi Ramamoorthi

We derive a spatially-varying (SV)BRDF-invariant theory for recovering 3D shape and reflectance from light-field cameras. Our key theoretical insight is a novel analysis of diffuse plus single-lobe SVBRDFs under a light-field setup. We show that, although direct shape recovery is not possible, an equation relating depths and normals can still be derived. Using this equation, we then propose using a polynomial (quadratic) shape prior to resolve the shape ambiguity. Once shape is estimated, we also recover the reflectance. 

CVPR 2016 | Fine-grained Categorization and Dataset Bootstrapping using Deep Metric Learning with Humans in the Loop
Yin Cui, Feng Zhou, Yuanqing Lin, Serge Belongie

We propose an iterative framework for fine-grained categorization and dataset bootstrapping. Using deep metric learning with humans in the loop, we learn a low dimensional feature embedding with anchor points on manifolds for each category. In each round, images with high confidence scores are sent to humans for labeling and the model is retrained based on the updated dataset. The proposed framework leads to significant performance gain.

CVPR 2016 | WarpNet: Weakly Supervised Matching for Single-view Reconstruction
Angjoo Kanazawa, Manmohan Chandraker, David W. Jacobs

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another, by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation. 

PDF | Supplementary
ICML 2016 | Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units
Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee

We show that the first few convolution layers of a deep CNN with ReLU activation functions manage to capture both negative and positive phase information through learning pairs or groups of negatively correlated filters which implies a redundancy among these filters. We propose a simple yet effective activation scheme called concatenated ReLU (CReLU) to solve this problem and to achieve better reconstruction and regularization properties.

Teaser Figure WACV 2016 | Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos
Chao-Yeh Chen, Wongun Choi, Manmohan Chandraker

We propose a novel framework for monocular traffic scene recognition, relying on a decomposition into high-order and atomic scenes to meet those challenges. High-order scenes carry semantic meaning useful for AWS applications, while atomic scenes are easy to learn and represent elemental behaviors based on 3D localization of individual traffic participants. We propose a novel hierarchical model that captures co-occurrence and mutual exclusion relationships while incorporating both low-level trajectory features and high-level scene features, with parameters learned using a structured support vector machine. We propose efficient inference that exploits the structure of our model to obtain real-time rates. 


Page 1 | Page 2