Media Analytics | Manmohan Chandraker

MEDIA ANALYTICS

PROJECTS

PEOPLE

PUBLICATIONS

PATENTS

Manmohan Chandraker

Manmohan Chandraker

Department Head

Media Analytics

Projects

Embodied AI

Overview: We develop embodied agents for robotics applications that require exploration, navigation and transport in complex scenes. Our modular hierarchical transport policy builds a topological graph of the scene to perform exploration, then combines motion planning algorithms to reach point goals within explored locations with object navigation policies for moving towards semantic targets at unknown locations.

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Overview: We propose a simple but effective way to mine unlabeled images using recently proposed vision and language (V\&L) models to generate pseudo labels for both known and novel categories, which suits both tasks, SSOD and OVD.

Publications

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

December 2, 2025/The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e. no LiDAR, GPS, etc.), existing

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

October 19, 2025/ICCV 2025

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

October 19, 2025/ICCV 2025

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

October 19, 2025/ICCV 2025

This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scenes appearance and geometry. To maintain long-range geometric consistency,

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

October 1, 2025/https://arxiv.org

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing

NEC Labs America Joins CS3 Advisory Board to Advance Smart Streetscapes

May 19, 2025

NEC Laboratories America has joined the Center for Smart Streetscapes (CS3) Advisory Board, a National Science Foundation–funded initiative advancing urban innovation through technology, data, and design. As a leader in AI, computer vision, and edge computing, NEC Labs America will collaborate with

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

April 24, 2025/ICLR 2025

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses >50% of its compute

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

January 17, 2025/arXiv

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

December 19, 2024/arXiv

The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

September 29, 2024/The 18th European Conference on Computer Vision ECCV 2024

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism; they also neglect the dynamics of agent interactions.

OPENCAM: Lensless Optical Encryption Camera

September 5, 2024/IEEE Transactions on Computational Imaging

Lensless cameras multiplex the incoming light before it is recorded by the sensor. This ability to multiplex the incoming light has led to the development of ultra-thin, high-speed, and single-shot 3D imagers. Recently, there have been various attempts at demonstrating another useful aspect of lensless

Foundational Vision-LLM for AI Linkage and Orchestration

July 3, 2024/NEC Technical Journal, Special Issue on Revolutionizing Business Practices with Generative AI

We propose a vision-LLM framework for automating development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding

Taming Self-Training for Open-Vocabulary Object Detection

June 17, 2024/CVPR2024

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD.

Improving the Efficiency-Accuracy Trade-off of DETR-Style Models in Practice

June 17, 2024/The 7th Workshop on Efficient Deep Learning for Computer Vision at CVPR 2024

This report aims to provide a comprehensive view on the inference efficiency of DETR-style detection models. We provide the effect of the basic efficiency techniques and identify the factors that are easily applicable yet effectively improve the efficiency-accuracy trade-off. Specifically, we explore

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

June 17, 2024/CVPR2024

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear

Instantaneous Perception of Moving Objects in 3D

June 17, 2024/CVPR2024

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

June 17, 2024/CVPR2024

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

June 17, 2024/CVPR2024

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of

Generating Enhanced Negatives for Training Language-Based Object Detectors

June 16, 2024/CVPR2024

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative

Long-HOT: A Modular Hierarchical Approach for Long-Horizon Object Transport

May 13, 2024/ICRA 24, PACIFICO Yokohama, Japan & CVPR2024 Seattle, WA

We aim to address key challenges in long-horizon embodied exploration and navigation by proposing a long-horizon object transport task called Long-HOT and a novel modular framework for temporally extended navigation. Agents in Long-HOT need to efficiently find and pick up target objects that are scattered

Efficient Transformer Encoders for Mask2Former-style Models

April 16, 2024/https://arxiv.org

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge

Improving Language-Based Object Detection by Explicit Generation of Negative Examples

December 21, 2023/https://arxiv.org

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models

Exploring Question Decomposition for Zero-Shot VQA

December 11, 2023/NeurIPS 2023

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently

OpEnCam: Optical Encryption Camera

December 8, 2023/https://arxiv.org

Lensless cameras multiplex the incoming light before it is recorded by the sensor. This ability to multiplex the incoming light has led to the development of ultra-thin, high-speed, and single-shot 3D imagers. Recently, there have been various attempts at demonstrating another useful aspect of lensless

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

December 8, 2023/https://arxiv.org

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On

Controllable Safety-Critical Closed-Loop Traffic Simulation via Guided Diffusion

December 8, 2023/https://arxiv.org

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail traffic scenarios. Traditional methods for generating safety-critical scenarios often fall short in realism and controllability. Furthermore, these techniques generally neglect the dynamics of agent

Efficient Controllable Multi-Task Architectures

October 2, 2023/ICCV 2023

We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models

Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters

October 2, 2023/ICCV 2023

Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. To compensate for the over-parameterized models, numerous regularization techniques have been introduced such as those based on dropout. While these methods achieve significant improvements on classical

Improving Pseudo Labels for Open-Vocabulary Object Detection

August 2, 2023/https://arxiv.org

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

June 18, 2023/CVPR 2023

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose

NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization

June 18, 2023/CVPR 2023

Monocular 3D object localization in driving scenes is a crucial task, but challenging due to its ill-posed nature. Estimating 3D coordinates for each pixel on the object surface holds great potential as it provides dense 2D-3D geometric constraints for the underlying PnP problem. However, high-quality

Split to Learn: Gradient Split for Multi-Task Human Image Analysis

January 3, 2023/WACV23

This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for

Single-Stream Multi-level Alignment for Vision-Language Pretraining

October 24, 2022/ECCV 2022

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

October 24, 2022/ECCV 2022

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available

Learning Phase Mask for Privacy-Preserving Passive Depth Estimation

October 24, 2022/ECCV 2022

With over a billion sold each year, cameras are not only becoming ubiquitous, but are driving progress in a wide range of domains such as mixed reality, robotics, and more. However, severe concerns regarding the privacy implications of camera-based solutions currently limit the range of environments

Learning Semantic Segmentation from Multiple Datasets with Label Shifts

October 24, 2022/ECCV 2022

While it is desirable to train segmentation models on an aggregation of multiple datasets, a major challenge is that the label space of each dataset may be in conflict with one another. To tackle this challenge, we propose UniSeg, an effective and model-agnostic approach to automatically train segmentation

Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts

June 19, 2022/CVPR'22

We propose an end-to-end network that takes a single perspective RGB image of a complex road scene as input, to produce occlusion-reasoned layouts in perspective space as well as a parametric bird’s-eye-view (BEV) space. In contrast to prior works that require dense supervision such as semantic labels

Controllable Dynamic Multi-Task Architectures

June 19, 2022/CVPR'22

Multi-task learning commonly encounters competition for resources among tasks, specifically when model capacity is limited. This challenge motivates models which allow control over the relative importance of tasks and total compute cost during inference time. In this work, we propose such a controllable

Learning to Learn across Diverse Data Biases in Deep Face Recognition

June 19, 2022/CVPR’22

Convolutional Neural Networks have achieved remarkable success in face recognition, in part due to the abundant availability of data. However, the data used for training CNNs is often imbalanced. Prior works largely focus on the long-tailed nature of face datasets in data volume per identity or focus

On Generalizing Beyond Domains in Cross-Domain Continual Learning

June 19, 2022/CVPR'22

Humans have the ability to accumulate knowledge of new tasks in varying conditions, but deep neural networks of-ten suffer from catastrophic forgetting of previously learned knowledge after learning a new task. Many recent methods focus on preventing catastrophic forgetting under the assumption of train

Learning Cross-Modal Contrastive Features for Video Domain Adaptation

October 11, 2021/ICCV 2021, Virtual

Learning transferable and domain adaptive feature representations from videos is important for video-relevant tasks such as action recognition. Existing video domain adaptation methods mainly rely on adversarial feature alignment, which has been derived from the RGB image space. However, video data is

Fusing the Old with the New: Learning Relative Pose with Geometry-Guided Uncertainty

June 19, 2021/CVPR 2021, Virtual

Learning methods for relative camera pose estimation have been developed largely in isolation from classical geometric approaches. The question of how to integrate predictions from deep neural networks (DNNs) and solutions from geometric solvers, such as the 5-point algorithm [37], has as yet remained

Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction

June 19, 2021/CVPR 2021, Virtual

Trajectory prediction is a safety-critical tool for autonomous vehicles to plan and execute actions. Our work addresses two key challenges in trajectory prediction, learning multimodal outputs, and better predictions by imposing constraints using driving knowledge. Recent methods have achieved strong

Cross-Domain Similarity Learning for Face Recognition in Unseen Domains

June 19, 2021/CVPR 2021, Virtual

Face recognition models trained under the assumption of identical training and test distributions often suffer from poor generalization when faced with unknown variations, such as a novel ethnicity or unpredictable individual make-ups during test time. In this paper, we introduce a novel cross-domain

Uncertainty Aware Physically Guided Proxy Tasks for Unseen Domain Face Anti-Spoofing

November 20, 2020/arXiv

Face anti-spoofing (FAS) seeks to discriminate genuine faces from fake ones arising from any type of spoofing attack. Due to the wide variety of attacks, it is implausible to obtain training data that spans all attack types. We propose to leverage physical cues to attain better generalization on unseen

Voting Based Approaches For Differentially Private Federated Learning

October 6, 2020/arXiv

Differentially Private Federated Learning (DPFL) is an emerging field with many applications. Gradient averaging-based DPFL methods require costly communication rounds and hardly work with large capacity models due to the explicit dimension dependence in its added noise. In this work, inspired by knowledge

Adaptation Across Extreme Variations using Unlabeled Bridges

September 7, 2020/BMVC’20, Manchester, UK

We tackle an unsupervised domain adaptation problem for which the domain discrepancy between labeled source and unlabeled target domains is large, due to many factors of inter- and intra-domain variation. While deep domain adaptation methods have been realized by reducing the domain discrepancy, these

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction

August 28, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling

Improving Face Recognition by Clustering Unlabeled Faces in the Wild

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

While deep face recognition has benefited significantly from large-scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction

Object Detection with a Unified Label Space from Multiple Datasets

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant—application-relevant categories can be picked and merged form

SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data,

Domain Adaptive Semantic Segmentation using Weak Labels

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human oracle in a new weakly-supervised domain adaptation (WDA)

Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv)

July 10, 2020

Read Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv). While deep face recognition has benefited significantly from large scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior

Peek-a-boo: Occlusion Reasoning in Indoor Scenes with Plane Representations

June 16, 2020/CVPR 2020

We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts

Private-kNN Practical Differential Privacy for Computer Vision

June 16, 2020/CVPR 2020

With increasing ethical and legal concerns on privacy for deep models in visual recognition, differential privacy has emerged as a mechanism to disguise membership of sensitive data in training datasets. Recent methods like Private Aggregation of Teacher Ensembles (PATE) leverage a large ensemble of

Towards Universal Representation Learning for Deep Face Recognition

June 16, 2020/CVPR 2020

Recognizing wild faces is extremely hard as they appear with all kinds of variations. Traditional methods either train with specifically annotated variation data from target domains, or by introducing unlabeled target variation data to adapt from the training data. Instead, we propose a universal representation

Understanding Road Layout from Videos as a Whole

June 16, 2020/CVPR 2020

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work,

Active Adversarial Domain Adaptation

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We propose an active learning approach for transferring representations across domains. Our approach, active adversarial domain adaptation (AADA), explores a duality between two related problems: adversarial domain alignment and importance sampling for adapting models across domains. The former uses

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition

DAVID: Dual-Attentional Video Deblurring

March 2, 2020/WACV 2020, Snowmass Village, CO USA

Blind video deblurring restores sharp frames from a blurry sequence without any prior. It is a challenging task because the blur due to camera shake, object movement and defocusing is heterogeneous in both temporal and spatial dimensions. Traditional methods train on datasets synthesized with a single

Adversarial Learning of Privacy-Preserving and Task-Oriented Representations

February 7, 2020/AAAI 2020, New York, New York USA

Data privacy has emerged as an important issue as data-driven deep learning has been an essential component of modern machine learning systems. For instance, there could be a potential privacy risk of machine learning systems via the model inversion attack, whose goal is to reconstruct the input data

Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM

November 3, 2019/IROS 2019, The Venetian Macao, Macau, China

Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Mapping (SLAM) systems, especially in driving scenarios. In this paper,

Domain Adaptation for Structured Output via Discriminative Patch Representations

October 27, 2019/ICCV 2019 - International Conference on Computer Vision 2019, Seoul, Korea

Predicting structured outputs such as semantic segmentation relies on expensive per-pixel annotations to learn supervised models like convolutional neural networks. However, models trained on one data domain may not generalize well to other domains without annotations for model finetuning. To avoid the

Deep Supervision with Intermediate Concepts (IEEE)

August 1, 2019/IEEE Transactions on Pattern Analysis and Machine Intelligence

Read Deep Supervision with Intermediate Concepts (IEEE). Recent data-driven approaches to scene interpretation predominantly pose inference as an end-to-end black-box mapping, commonly performed by a Convolutional Neural Network (CNN). However, decades of work on perceptual organization in both human

Pose-variant 3D Facial Attribute Generation

July 23, 2019/arXiv

We address the challenging problem of generating facial attributes using a single image in an unconstrained pose. In contrast to prior works that largely consider generation on 2D near-frontal images, we propose a GAN-based framework to generate attributes directly on a dense 3D representation given

A Dataset for High-Level 3D Scene Understanding of Complex Road Scenes in the Top-View

June 17, 2019/Proceedings of CVPR 2019 Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

We introduce a novel dataset for high-level 3D scene understanding of complex road scenes. Our annotations extend the existing datasets KITTI [5] and nuScenes [1] with semantically and geometrically meaningful attributes like the number of lanes or the existence of, and distance to, intersections, sidewalks

Learning Structure-And-Motion-Aware Rolling Shutter Correction

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

An exact method of correcting the rolling shutter (RS) effect requires recovering the underlying geometry, i.e. the scene structures and the camera motions between scanlines or between views. However, the multiple-view geometry for RS cameras is much more complicated than its global shutter (GS) counterpart,

A Parametric Top-View Representation of Complex Road Scenes

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

In this paper, we address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable

Feature Transfer Learning for Face Recognition with Under-Represented Data

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

Despite the large volume of face recognition datasets, there is a significant portion of subjects, of which the samples are insufficient and thus under-represented. Ignoring such significant portion results in insufficient training data. Training with under-represented data leads to biased classifiers

Gotta Adapt Em All: Joint Pixel and Feature-Level Domain Adaptation for Recognition in the Wild

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a

Learning To Simulate

May 6, 2019/Seventh International Conference on Learning Representations (ICLR 2019)

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling

Unsupervised Domain Adaptation for Distance Metric Learning

May 6, 2019/Seventh International Conference on Learning Representations (ICLR 2019)

Unsupervised domain adaptation is a promising avenue to enhance the performance of deep neural networks on a target domain, using labels only from a source domain. However, the two predominant methods, domain discrepancy reduction learning and semi-supervised learning, are not readily applicable when

Memory Warps for Long-Term Online Video Representations and Anticipation

January 8, 2019/Winter Conference on Applications of Computer Vision (WACV) 2019, Waikoloa Village, Hawaii USA

We propose a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore motion when aligning features over time, or operate in an off-line mode to utilize future frames.

Unsupervised Cross Domain Distance Metric Adaptation with Feature Transfer Network

October 11, 2018/BayLearn Symposium 2018, Menlo Park, CA USA

Unsupervised domain adaptation is an attractive avenue to enhance the performance of deep neural networks in a target domain, using labels only from a source domain. However, two predominant methods along this line, namely, domain divergence reduction learning and semi-supervised learning, are not readily

Learning to Look around Objects for Top-View Representations of Outdoor Scenes

September 8, 2018/European Conference on Computer Vision – ECCV 2018, Munich, Germany

Given a single RGB image of a complex outdoor road scene in the perspective view, we address the novel problem of estimating an occlusion-reasoned semantic scene layout in the top-view. This challenging problem not only requires an accurate understanding of both the 3D geometry and the semantics of the

Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences

September 8, 2018/European Conference on Computer Vision - ECCV 2018, Munich, Germany

Interest point descriptors have fueled progress on almost every problem in computer vision. Recent advances in deep neural networks have enabled task-specific learned descriptors that outperform hand-crafted descriptors on many problems. We demonstrate that commonly used metric learning approaches do

Learning to Adapt Structured Output Space for Semantic Segmentation

June 18, 2018/Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT USA

Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the

Memory Warps for Learning Long-Term Online Video Representations

March 28, 2018/arXiv

This paper proposes a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore actual motion when aligning features over time, or operate in an off-line mode to utilize

Feature Transfer Learning for Deep Face Recognition with Long-Tail Data

March 23, 2018/arXiv

Real-world face recognition datasets exhibit long-tail characteristics, which results in biased classifiers in conventionally-trained deep neural networks, or insufficient data when long-tail classes are ignored. In this paper, we propose to handle long-tail classes in the training of a face recognition

SVBRDF-Invariant Shape and Reflectance Estimation from a Light-Field Camera

February 22, 2018/IEEE Transactions on Pattern Analysis and Machine Intelligence

Light-field cameras have recently emerged as a powerful tool for one-shot passive 3D shape capture. However, obtaining the shape of glossy objects like metals or plastics remains challenging, since standard Lambertian cues like photo-consistency cannot be easily applied. In this paper, we derive a spatially-varying

Joint Pixel and Feature-level Domain Adaptation in the Wild

February 5, 2018/arXiv

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a

Learning random-walk label propagation for weakly-supervised semantic segmentation

February 1, 2018/arXiv

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data for this task relative to other vision tasks. We propose a novel training approach to address this difficulty. Given cheaply-obtained sparse image labelings, we propagate the sparse labels to produce

WarpNet: Weakly Supervised Matching for Single-View Reconstruction

June 1, 2016/CVPR 2016

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised discriminative learning