We conduct research in computer vision and machine learning, with a focus on sustaining excellence in three main directions: (1) scene understanding; (2) recognition and representation; and (3) adaptation, fairness and privacy. Key applications of our research include visual surveillance and autonomous driving. We tackle fundamental problems in computer vision, such as object detection, semantic segmentation, face recognition, 3D reconstruction and behavior prediction. We develop and leverage breakthroughs in deep learning, particularly with a flavor of weak supervision, metric learning and domain adaptation.

Recognition and Representation

We learn powerful representations from visual data that generalize across variations in imaging conditions. Our solutions achieve robustness, speed and social fairness with minimal human labeling effort, by reasoning about categories, shapes, actions and relationships.

ECCV 2020 Object Detection with a Unified Label Space from Multiple Datasets
Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, Ying Wu

Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant—application-relevant categories can be picked and merged form arbitrary existing datasets. However, naïve merging of datasets is not possible in this case, due to inconsistent object annotations. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case.

PDF | Supplementary | Project Site | Dataset
ECCV 2020 Improving Face Recognition by Clustering Unlabeled Faces in the Wild
Aruni RoyChowdhury, Xiang Yu, Kihyuk Sohn, Erik Learned-Miller, Manmohan Chandraker

We propose a novel identity separation method based on extreme value theory. It is formulated as an out-of-distribution detection algorithm, and greatly reduces the problems caused by overlapping-identity label noise. Considering cluster assignments as pseudo-labels, we must also overcome the labeling noise from clustering errors. We propose a modulation of the cosine loss, where the modulation weights correspond to an estimate of clustering uncertainty.

CVPR 2020 Towards Universal Representation Learning for Deep Face Recognition
Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chandraker, Anil K. Jain

Traditional recognition models require target domain data to adapt from the high-quality training data to conduct unconstrained/low-quality face recognition. Model ensemble is further needed for a universal representation purpose which significantly increases model complexity. In contrast, our universal face representation learning (URFace) works only on original training data without any target domain data information, and can deal with unconstrained and unseen testing scenarios. 

WACV 2020 |  Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones
Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and Jia-Bin Huang

We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition datasets and unannotated (or only a few annotated) videos from drones. To study the emerging problem of drone-based action recognition, we create a new dataset, NEC-DRONE, containing 5,250 videos to evaluate the task.

PDF | Project Site | Dataset
CVPR 2019 | A Parametric Top-View Representation of Complex Road Scenes
Ziyan Wang , Buyu Liu, Samuel Schulter, Manmohan Chandraker

We address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable interface for higher-level decision making. Moreover, the design of our top-view scene model allows for efficient sampling and thus generation of large-scale simulated data, which we leverage to train a deep neural network to infer our scene model's parameters. Finally, we design a Conditional Random Field (CRF) that enforces coherent predictions for a single frame and encourages temporal smoothness among video frames.

PDF | Project Site | Dataset
CVPR 2019 | Feature Transfer Learning for Face Recognition with Under-Represented Data
Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, Manmohan Chandraker

Training with under-represented data leads to biased classifiers in conventionally-trained deep networks. We propose a center-based feature transfer framework to augment the feature space of under-represented subjects from the regular subjects that have sufficiently diverse samples. A Gaussian prior of the variance is assumed across all subjects and the variance from regular ones is transferred to the under-represented ones. This encourages the under-represented distribution to be closer to the regular distribution. Further, an alternating training regimen is proposed to simultaneously achieve less biased classifiers and a more discriminative feature representation. 

CVPR 2019 |  Gotta Adapt ’Em All: Joint Pixel and Feature-Level Domain Adaptation for Recognition in the Wild
Luan Tran, Kihyuk Sohn, Xiang Yu, Xiaoming Liu, Manmohan Chandraker

We provide a solution that allows knowledge transfer from fully annotated source images to unlabeled target ones, often captured in a different condition. We adapt at multiple semantic levels from feature to pixel, with complementary insights for each type. Utilizing the proposal, we achieve better recognition accuracy of car images in unlabeled surveillance domains by adapting the knowledge from car images on the web.

ICML 2019 | Neural Collaborative Subspace Clustering
Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, Hongdong Li

We introduce the Neural Collaborative Subspace Clustering, a neural model that discovers clusters of data points drawn from a union of low-dimensional subspaces. In contrast to previous attempts, our model runs without the aid of spectral clustering. This makes our algorithm one of the kinds that can gracefully scale to large datasets. At its heart, our neural model benefits from a classifier which determines whether a pair of points lies on the same subspace or not. Essential to our model is the construction of two affinity matrices, one from the classifier and one based on a notion of subspace self-expressiveness, to supervise training in a collaborative scheme. 

ICCV 2019 |  Domain Adaptation for Structured Output via Discriminative Patch Representations
Yi-Hsuan Tsai, Kihyuk Sohn, Samuel Schulter, Manmohan Chandraker

We tackle domain adaptive semantic segmentation via learning discriminative feature representations of patches in the source domain by discovering multiple modes of patch-wise output distribution through the construction of a clustered space. With such guidance, we use an adversarial learning scheme to push the feature representations of target patches in the clustered space closer to the distributions of source patches. we show that our framework is complementary to existing domain adaptation techniques.

PDF | Supplementary | Project Site | Dataset
ECCV 2018 | Zero-Shot Object Detection
Ankan Bansal, Karan Sikka , Gaurav Sharma , Rama Chellappa , Ajay Divakaran

We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes that are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD. We then discuss the problems associated with selecting a background class and propose two background-aware approaches for learning robust detectors. Finally, we propose novel splits of two standard detection datasets – MSCOCO and VisualGenome, and present extensive empirical results. 

ICCV 2017 | Towards Large-Pose Face Frontalization in the Wild
Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, Manmohan Chandraker

Despite recent advances in deep face recognition, severe accuracy drops are observed under large pose variations. Learning pose-invariant features is feasible but needs expensively labeled data. In this work, we focus on frontalizing faces in the wild under various head poses. We propose a novel deep 3D Morphable Model (3DMM) conditioned Face Frontalization Generative Adversarial Network (GAN), termed as FF-GAN, to generate neutral head pose face images, showing photo-realistic visual effects. 

Teaser Figure ICCV 2017 | Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos
Kihyuk Sohn, Sifei Liu, Guangyu Zhong, Xiang Yu, Ming-Hsuan Yang, Manmohan Chandraker

Despite rapid advances in face recognition, there remains a clear gap between the performance of still image-based face recognition and video-based face recognition. To address this, we propose an image to video feature-level domain adaptation method to learn discriminative video frame representations. It is achieved by distilling knowledge from the network to a video adaptation network, performing feature restoration through synthetic data augmentation and learning a domain-invariant feature through a domain adversarial discriminator. Experiments on YouTube Faces and IJB-A demonstrate our method achieves state-of-the-art accuracy on video face recognition. 

ICCV 2017  | Scene Parsing with Global Context Embedding
Wei-Chih Hung, Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, Ming-Hsuan Yang

We present a scene parsing method that utilizes global context information based on both the parametric and non-parametric models. Compared to previous methods that only exploit the local relationship between objects, we train a context network based on scene similarities to generate feature representations for global contexts. We show that the proposed method can eliminate false positives that are not compatible with the global context representations. 

CVPR 2017 | Learning random-walk label propagation for weakly-supervised semantic segmentation
Paul Vernaza, Manmohan Chandraker

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data. Given cheaply obtained sparse image labelings, we propagate the sparse labels to produce guessed dense labelings using random-walk hitting probabilities, which leads to a differentiable parameterization with uncertainty estimates that are incorporated into our loss. We show that our method can effectively learn semantic edges given no direct edge supervision. 

NeurIPS 2017 | Learning Efficient Object Detection Models with Knowledge Distillation
Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, Manmohan Chandraker

Deep object detectors require prohibitive runtimes to process an image for real-time applications. Model compression can learn compact models with fewer parameters, but accuracy is significantly degraded. In this work, we propose a new framework to learn compact and fast object detection networks with improved accuracy using knowledge distillation and hint learning. Our results show consistent improvement in accuracy-speed trade-off across PASCAL, KITTI, ILSVRC and MS-COCO. 

NeurIPS 2016 | Improved Deep Metric Learning with Multi-class N-pair Loss Objective
Kihyuk Sohn

We tackle the problem of unsatisfactory convergence of training a deep neural network for metric learning by proposing multi-class N-pair loss. Unlike many other objective functions that ignore the information lying in the interconnections between the samples, N-pair loss utilizes full interaction of the examples from different classes within a batch. We also propose an efficient batch construction strategy using only N pairs of examples. 

Teaser Figure ECCV 2016 | A 4D Light-Field Dataset and CNN Architectures for Material Recognition
Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker, Alexei Efros, Ravi Ramamoorthi

We introduce a new light-field dataset of materials, and take advantage of the recent success of deep learning to perform material recognition on the 4D light-field. Our dataset contains 12 material categories, each with 100 images taken with a Lytro Illum, from which we extract about 30,000 patches in total. To the best of our knowledge, this is the first mid-size dataset for light-field images. Our main goal is to investigate whether the additional information in a light-field (such as multiple sub-aperture views and view-dependent reflectance effects) can aid material recognition. Since recognition networks have not been trained on 4D images before, we propose and compare several novel CNN architectures to train on light-field images. 

CVPR 2016 | Embedding Label Structures for Fine-Grained Feature Representation
Xiaofan Zhang, Feng Zhou, Yuanqing Lin, Shaoting Zhang

We model the multi-level relevance among fine-grained classes for fine-grained categorization. We jointly optimize classification and similarity constraints in a proposed multi-task learning framework, and embed label structures such as hierarchy or shared attributes into the framework by generalizing the triplet loss. This significantly outperforms previous fine-grained feature representations for image retrieval at different levels of relevance.

CVPR 2016 | Fine-grained Image Classification by Exploring Bipartite-Graph Labels
Feng Zhou, Yuanqing Lin

We exploit the rich relationships among the fine-grained classes for fine-grained image classification. We model the relations using the proposed bipartite-graph labels (BGL) and incorporate it into CNN training. Our system is computationally efficient in inference thanks to the bipartite structure. Also, we construct a new food benchmark dataset, consisting of 37,885 food images collected from 6 restaurants and totally 975 menus.

CVPR 2016 | Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers
Fan Yang, Wongun Choi, Yuanqing Lin

We propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Second, we construct an efficient region-of-interest segmentation network to generate part masks, with a similarity-based scoring function to refine these object parts and generate final segmentation outputs. 

CVPR 2016 | SVBRDF-Invariant Shape and Reflectance Estimation from Light-Field Cameras
Ting-Chun Wang, Manmohan Chandraker, Alexei Efros, Ravi Ramamoorthi

We derive a spatially-varying (SV)BRDF-invariant theory for recovering 3D shape and reflectance from light-field cameras. Our key theoretical insight is a novel analysis of diffuse plus single-lobe SVBRDFs under a light-field setup. We show that, although direct shape recovery is not possible, an equation relating depths and normals can still be derived. Using this equation, we then propose using a polynomial (quadratic) shape prior to resolve the shape ambiguity. Once the shape is estimated, we also recover the reflectance. 

CVPR 2016 | WarpNet: Weakly Supervised Matching for Single-view Reconstruction
Angjoo Kanazawa, Manmohan Chandraker, David W. Jacobs

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another, by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation. 

PDF | Supplementary
ICML 2016 | Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units
Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee

We show that the first few convolution layers of a deep CNN with ReLU activation functions manage to capture both negative and positive phase information through learning pairs or groups of negatively correlated filters which implies a redundancy among these filters. We propose a simple yet effective activation scheme called concatenated ReLU (CReLU) to solve this problem and to achieve better reconstruction and regularization properties.

Teaser Figure WACV 2016 | Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos
Chao-Yeh Chen, Wongun Choi, Manmohan Chandraker

We propose a novel framework for monocular traffic scene recognition, relying on a decomposition into high-order and atomic scenes to meet those challenges. High-order scenes carry semantic meaning useful for AWS applications, while atomic scenes are easy to learn and represent elemental behaviors based on 3D localization of individual traffic participants. We propose a novel hierarchical model that captures co-occurrence and mutual exclusion relationships while incorporating both low-level trajectory features and high-level scene features, with parameters learned using a structured support vector machine. We propose efficient inference that exploits the structure of our model to obtain real-time rates.