We conduct research in computer vision and machine learning, with a focus on sustaining excellence in three main directions: (1) scene understanding; (2) visual recognition and representation learning; and (3) adaptation, fairness and privacy. Key applications of our research include visual surveillance and autonomous driving. We tackle fundamental problems in computer vision, such as object detection, semantic segmentation, face recognition, 3D reconstruction and behavior prediction. We develop and leverage breakthroughs in deep learning, particularly with a flavor of weak supervision, metric learning and domain adaptation.

Scene Understanding

We understand latent information in complex images to make better predictions. Our solutions enable interpretability, intuitive visualization and prediction of multimodal future outcomes by reasoning about positions, dynamics, semantics, interactions and intents in 3D scenes.

ECCV 2020 Image Stitching and Rectification for Hand-Held Cameras
Bingbing Zhuang, Quoc-Huy Tran

We derive a new differential homography that can account for the scanline-varying camera poses in Rolling Shutter (RS) cameras, and demonstrate its application to carry out RS-aware image stitching and rectification at one stroke. Despite the high complexity of RS geometry, we focus in this paper on a special yet common input — two consecutive frames from a video stream, wherein the interframe motion is restricted from being arbitrarily large. 

ECCV 2020 Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling
Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, Manmohan Chandraker

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. 

PDF | Supplementary
ECCV 2020 Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction
Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket An, Manmohan Chandraker

Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the others shortcomings. 

PDF | Supplementary
ECCV 2020 SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction
Sriram N N, Buyu Liu, Francesco Pittaluga, Manmohan Chandraker

We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data, which is difficult to acquire with sufficient coverage of possible modes. 

PDF | Supplementary
CVPR 2020 | Peek-a-boo: Occlusion Reasoning in Indoor Scenes with Plane Representations
Ziyu Jiang, Buyu Liu, Samuel Schulter, Zhangyang Wang, Manmohan Chandraker

We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts of the plane. We infer these planes from a single input image with a novel neural network architecture. It consists of a two-branch category-specific module that aims to predict layout and objects of the scene separately so that different types of planes can be handled better. We also introduce a novel loss function based on plane warping that can leverage multiple views at training time for improved occlusion-aware reasoning. 

CVPR 2020 | Understanding Road Layout from Videos as a Whole
Buyu Liu, Bingbing Zhuang, Samuel Schulter, Pan Ji, Manmohan Chandraker

We address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cues and incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. 

CVPR 2019 | A Parametric Top-View Representation of Complex Road Scenes
Ziyan Wang , Buyu Liu, Samuel Schulter, Manmohan Chandraker

We address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable interface for higher-level decision making. Moreover, the design of our top-view scene model allows for efficient sampling and thus generation of large-scale simulated data, which we leverage to train a deep neural network to infer our scene model's parameters. Finally, we design a Conditional Random Field (CRF) that enforces coherent predictions for a single frame and encourages temporal smoothness among video frames.

PDF | Project Site | Dataset
CVPR 2019 | Structure-And-Motion-Aware Rolling Shutter Correction
Bingbing Zhuang, Quoc-Huy Tran, Pan Ji, Loong Fah Cheong, Manmohan Chandraker

In this paper, we first make a theoretical contribution by proving that RS two-view geometry is degenerate in the case of pure translational camera motion. In view of the complex RS geometry, we then propose a Convolutional Neural Network-based method which learns the underlying geometry (camera motion and scene structure) from just a single RS image and perform RS image correction. We propose a geometrically meaningful way to synthesize large-scale training data and identify a geometric ambiguity that arises for training.  

PDF | Supplementary | Project Site
ICLR 2019 | Learning to Simulate
Nataniel Ruiz, Samuel Schulter, Manmohan Chandraker

Simulation can be a useful tool when obtaining and annotating train data is costly. However, optimal tuning of simulator parameters itself can be a laborious task. We implement a meta-learning algorithm in which a reinforcement learning agent, as the met learner, automatically adjusts the parameters of a non-differentiable simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. 

ICCV 2019 |  GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition
Hao Zhou, Xiang Yu, David Jacobs

Traditional intrinsic image decomposition focuses on decomposing images into reflectance and shading, leaving surfaces normals and lighting entangled in shading. In this work, we propose a Global-Local Spherical Harmonics (GLoSH) lighting model to improve the lighting component, and jointly predict reflectance and surface normals. The global SH models the holistic lighting while local SH accounts for the spatial variation of lighting. Also, a novel non-negative lighting constraint is proposed to encourage the estimated SH to be physically meaningful.

PAMI 2019 | Deep Supervision with Intermediate Concepts
Chi Li, M. Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker

We propose an approach for injecting prior domain structure into CNN training by supervising hidden layers with intermediate concepts. We formulate a probabilistic framework that predicts improved generalization through our deep supervision. This allows training only from synthetic CAD renderings where concept values can be extracted, while achieving generalization to real images. We obtain state-of-the-art performances on 2D and 3D keypoint localization, instance segmentation and image classification, outperforming alternative forms of supervision such as multi-task training. 

PDF | Project Site
IROS 2019 | Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM
Bingbing Zhuang, Quoc-Huy Tran, Gim Hee Lee, Loong Fah Cheong, Manmohan Chandraker

We first revisit the geometric approach to radial distortion self-calibration, and provide a proof that explicitly shows the ambiguity between radial distortion and scene depth under forward camera motion. In view of such geometric degeneracy and the prevalence of forward motion in practice, we further propose a learning approach that trains a convolutional neural network on a large amount of synthetic data to estimate the camera parameters, and show its application to SLAM without knowing camera parameters a prior.  

PDF | Supplementary | Project Site
IROS 2019 | Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles
Siddharth Srivastava, Frederic Jurie, Gaurav Sharma

We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D data to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy data, such a method is not only feasible, but gives higher results than many methods working on actual 3D inputs acquired from physical sensors.

Teaser Figure ECCV 2018 | Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences
Mohammed E. Fathy, Quoc-Huy Tran, M. Zeeshan Zia, Paul Vernaza, Manmohan Chandraker

While a metric loss applied to the deepest layer of a CNN is expected to yield ideal features, the growing receptive field and striding effects cause shallower features to be better at high precision matching. We leverage this insight along with hierarchical supervision to learn more effective descriptors for geometric matching. We evaluate for 2D and 3D geometric matching as well as optical flow, demonstrating state-of-the-art results and generalization across multiple datasets. 

ECCV 2018 | Learning to Look around Objects for Top-View Representations of Outdoor Scenes
Samuel Schulter, Menghua Zhai, Nathan Jacobs, Manmohan Chandraker

We propose a convolutional neural network that learns to predict occluded portions of the scene layout by looking around foreground objects like cars or pedestrians. But instead of hallucinating RGB values, we show that directly predicting the semantics and depths in the occluded areas enables a better transformation into the top-view. We further show that this initial top-view representation can be significantly enhanced by learning priors and rules about typical road layouts from simulated or, if available, map data. Crucially, training our model does not require costly or subjective human annotations for occluded areas or the top-view, but rather uses readily available annotations for standard semantic segmentation. 

ECCV 2018 | R2P2: A Reparameterized Pushforward Policy for Diverse, Precise Generative Path Forecasting
Nicholas Rhinehart, Kris M. Kitani, Paul Vernaza

We propose a method to forecast a vehicle’s ego-motion as a distribution over spatiotemporal paths, conditioned on features embedded in an overhead map. The method learns a policy inducing a distribution over simulated trajectories that is both “diverse” (produces most paths likely under the data) and “precise” (mostly produces paths likely under the data). We achieve this balance through minimization of a symmetrized cross-entropy between the distribution and demonstration data. 

PDF | Supplementary
CVPR 2018 | Fast and Accurate Online Video Object Segmentation via Tracking Parts
Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, Ming-Hsuan Yang

We propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Second, we construct an efficient region-of-interest segmentation network to generate part masks, with a similarity-based scoring function to refine these object parts and generate final segmentation outputs. 

CVPR 2018 | Learning to Adapt Structured Output Space for Semantic Segmentation
Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, Manmohan Chandraker

We develop a semantic segmentation method for adapting source ground truth labels to the unseen target domain. To achieve it, we consider semantic segmentation as structured prediction with spatial similarities between the source and target domains, and then adopt multi-level adversarial learning in the output space. We show that our method can perform adaptation under various settings, including synthetic-to-real and cross-city scenarios. 

PDF | Supplementary
Teaser Figure ACCV 2018 | Unseen Object Segmentation in Videos via Transferable Representations
Yi-Wen Chen , Yi-Hsuan Tsai, Chu-Ya Yang , Yen-Yu Lin , Ming-Hsuan Yang

We exploit existing annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into two tasks: 1) solving a submodular function for selecting object-like segments, and 2) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between two tasks to self-learn the final solution for object segmentation. 

ICCV 2017 | SegFlow: Joint Learning for Video Object Segmentation and Optical Flow
Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, Ming-Hsuan Yang

We propose an end-to-end trainable network, SegFlow, for simultaneously predicting pixel-wise object segmentation and optical flow in videos. The proposed SegFlow has two branches where useful information of object segmentation and optical flow is propagated bidirectionally in a unified framework. The unified framework can be trained iteratively offline to learn a generic notion, or fine-tuned online for specific objects. 

ICCV 2017  | Scene Parsing with Global Context Embedding
Wei-Chih Hung, Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, Ming-Hsuan Yang

We present a scene parsing method that utilizes global context information based on both the parametric and non-parametric models. Compared to previous methods that only exploit the local relationship between objects, we train a context network based on scene similarities to generate feature representations for global contexts. We show that the proposed method can eliminate false positives that are not compatible with the global context representations. 

CVPR 2017 | Learning random-walk label propagation for weakly-supervised semantic segmentation
Paul Vernaza, Manmohan Chandraker

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data. Given cheaply obtained sparse image labelings, we propagate the sparse labels to produce guessed dense labelings using random-walk hitting probabilities, which leads to a differentiable parameterization with uncertainty estimates that are incorporated into our loss. We show that our method can effectively learn semantic edges given no direct edge supervision. 

CVPR 2017 | DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B. Choy, Philip H. S. Torr, Manmohan Chandraker

We introduce a Deep Stochastic IOC RNN Encoder- decoder framework, DESIRE, for the task of future prediction of multiple interacting agents in dynamic scenes. It produces accurate future predictions by tackling multi-modality of futures while accounting for a rich set of both static and dynamic scene contexts. It generates a diverse set of hypothetical prediction samples, and then ranks and refines them through a deep IOC network. 

CVPR 2017 | Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
Chi Li, Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker

We propose a deep CNN architecture to localize object semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. We exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer a causal sequence of intermediate concepts. We render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. The utility of our deep supervision is demonstrated by state-of-the-art performances on real image benchmarks for 2D and 3D keypoint localization and instance segmentation. 

PDF | Dataset
CVPR 2017 | Deep Network Flow for Multi-Object Tracking
Samuel Schulter, Paul Vernaza, Wongun Choi, Manmohan Chandraker

We demonstrate that it is possible to learn features for network-flow-based data association via backpropagation, by expressing the optimum of a smoothed network flow problem as a differentiable function of the pairwise association costs. We apply this approach to multi-object tracking with a network flow formulation. Our experiments demonstrate that we are able to successfully learn all cost functions for the association problem in an end-to-end fashion, which outperform hand-crafted costs in all settings. The integration and combination of various sources of inputs become easy and the cost functions can be learned entirely from data, alleviating tedious hand-designing of costs. 

NeurIPS 2016 | Universal Correspondence Network
Christopher B. Choy, JunYoung Gwak, Silvio Savarese, Manmohan Chandraker

We present deep metric learning to obtain a feature space that preserves geometric or semantic similarity. Our visual correspondences span across rigid motions to intra-class shape or appearance variations. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation and linear time testing instead of quadratic time for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT.  

PDF | Supplementary | Project Site | Code
ECCV 2016  | Deep Deformation Network for Object Landmark Localization
Xiang Yu, Feng Zhou, Manmohan Chandraker

We propose a cascaded framework for localizing landmarks in non-rigid objects. The first stage initializes the shape as constrained to lie within a low-rank manifold, and the second stage estimates local deformations parameterized as thin-plate spline transformations. Since our framework does not incorporate either handcrafted features or part connectivity, it is easy to train and test, and generally applicable to various object types. 

CVPR 2016 | A Continuous Occlusion Model for Road Scene Understanding
Vikas Dhiman, Quoc-Huy Tran, Jason Corso, Manmohan Chandraker

We present a physically interpretable 3D model for handling occlusions with applications to road scene understanding. Given object detection and SFM point tracks, our unified model probabilistically assigns point tracks to objects and reasons for object detection scores and bounding boxes. It uniformly handles static and dynamic objects and thus outperforms motion segmentation for association problems. Furthermore, we demonstrate occlusion-aware 3D localization in road scenes. 

CVPR 2016 | WarpNet: Weakly Supervised Matching for Single-view Reconstruction
Angjoo Kanazawa, Manmohan Chandraker, David W. Jacobs

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another, by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation. 

PDF | Supplementary