Media Analytics | Publications

MEDIA ANALYTICS

PROJECTS

PEOPLE

PATENTS

Publications

Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation

March 6, 2026/5th Workshop on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model in conjunction with WACV 2026

Vision transformer-based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

February 23, 2026/arXiv

Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes,

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

December 2, 2025/The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e. no LiDAR, GPS, etc.), existing

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

October 19, 2025/ICCV 2025

This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scenes appearance and geometry. To maintain long-range geometric consistency,

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

October 19, 2025/ICCV 2025

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

October 19, 2025/ICCV 2025

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning

Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

October 19, 2025/The 4th DataCV Workshop and Challenge at ICCV 2025

Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels without distinguishing semantically important types such as stop signs or speed

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

October 1, 2025/https://arxiv.org

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

April 24, 2025/ICLR 2025

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses >50% of its compute

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

January 17, 2025/arXiv

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

December 19, 2024/arXiv

The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

September 29, 2024/The 18th European Conference on Computer Vision ECCV 2024

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism; they also neglect the dynamics of agent interactions.

OPENCAM: Lensless Optical Encryption Camera

September 5, 2024/IEEE Transactions on Computational Imaging

Lensless cameras multiplex the incoming light before it is recorded by the sensor. This ability to multiplex the incoming light has led to the development of ultra-thin, high-speed, and single-shot 3D imagers. Recently, there have been various attempts at demonstrating another useful aspect of lensless

Foundational Vision-LLM for AI Linkage and Orchestration

July 3, 2024/NEC Technical Journal, Special Issue on Revolutionizing Business Practices with Generative AI

We propose a vision-LLM framework for automating development and deployment of computer vision solutions for pre-defined or custom-defined tasks. A foundational layer is proposed with a code-LLM AI orchestrator self-trained with reinforcement learning to create Python code based on its understanding

Taming Self-Training for Open-Vocabulary Object Detection

June 17, 2024/CVPR2024

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD.

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

June 17, 2024/CVPR2024

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

June 17, 2024/CVPR2024

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive

Instantaneous Perception of Moving Objects in 3D

June 17, 2024/CVPR2024

The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

June 17, 2024/CVPR2024

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear

Improving the Efficiency-Accuracy Trade-off of DETR-Style Models in Practice

June 17, 2024/The 7th Workshop on Efficient Deep Learning for Computer Vision at CVPR 2024

This report aims to provide a comprehensive view on the inference efficiency of DETR-style detection models. We provide the effect of the basic efficiency techniques and identify the factors that are easily applicable yet effectively improve the efficiency-accuracy trade-off. Specifically, we explore

Generating Enhanced Negatives for Training Language-Based Object Detectors

June 16, 2024/CVPR2024

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative

Long-HOT: A Modular Hierarchical Approach for Long-Horizon Object Transport

May 13, 2024/ICRA 24, PACIFICO Yokohama, Japan & CVPR2024 Seattle, WA

We aim to address key challenges in long-horizon embodied exploration and navigation by proposing a long-horizon object transport task called Long-HOT and a novel modular framework for temporally extended navigation. Agents in Long-HOT need to efficiently find and pick up target objects that are scattered

Efficient Transformer Encoders for Mask2Former-style Models

April 16, 2024/https://arxiv.org

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge

Improving Language-Based Object Detection by Explicit Generation of Negative Examples

December 21, 2023/https://arxiv.org

The recent progress in language-based object detection with an open-vocabulary can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training from image captions with grounded bounding boxes (ground truth or pseudo-labeled) enable the models

Exploring Question Decomposition for Zero-Shot VQA

December 11, 2023/NeurIPS 2023

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

December 10, 2023/NeurIPS 2023

Data augmentation techniques, such as image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches,

Controllable Safety-Critical Closed-Loop Traffic Simulation via Guided Diffusion

December 8, 2023/https://arxiv.org

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail traffic scenarios. Traditional methods for generating safety-critical scenarios often fall short in realism and controllability. Furthermore, these techniques generally neglect the dynamics of agent

LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning

December 8, 2023/https://arxiv.org

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On

OpEnCam: Optical Encryption Camera

December 8, 2023/https://arxiv.org

Lensless cameras multiplex the incoming light before it is recorded by the sensor. This ability to multiplex the incoming light has led to the development of ultra-thin, high-speed, and single-shot 3D imagers. Recently, there have been various attempts at demonstrating another useful aspect of lensless

Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters

October 2, 2023/ICCV 2023

Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. To compensate for the over-parameterized models, numerous regularization techniques have been introduced such as those based on dropout. While these methods achieve significant improvements on classical

Efficient Controllable Multi-Task Architectures

October 2, 2023/ICCV 2023

We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models

LDP-Feat: Image Features with Local Differential Privacy

October 2, 2023/ICCV 2023

Modern computer vision services often require users to share raw feature descriptors with an untrusted server. This presents an inherent privacy risk, as raw descriptors may be used to recover the source images from which they were extracted. To address this issue, researchers recently proposed privatizing

OmniLabel: A Challenging Benchmark for Language-Based Object Detection

October 2, 2023/ICCV 2023

Language-based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking. With OmniLabel, we propose a novel task definition,

Improving Pseudo Labels for Open-Vocabulary Object Detection

August 2, 2023/https://arxiv.org

Recent studies show promising performance in open-vocabulary object detection (OVD) using pseudo labels (PLs) from pretrained vision and language models (VLMs). However, PLs generated by VLMs are extremely noisy due to the gap between the pretraining objective of VLMs and OVD, which blocks further advances

Confidence and Dispersity Speak: Characterizing Prediction Matrix for Unsupervised Accuracy Estimation

July 23, 2023

Confidence and Dispersity Speak: Characterizing Prediction Matrix for Unsupervised Accuracy Estimation This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another

NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization

June 18, 2023/CVPR 2023

Monocular 3D object localization in driving scenes is a crucial task, but challenging due to its ill-posed nature. Estimating 3D coordinates for each pixel on the object surface holds great potential as it provides dense 2D-3D geometric constraints for the underlying PnP problem. However, high-quality

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

June 18, 2023/CVPR 2023

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose

Split to Learn: Gradient Split for Multi-Task Human Image Analysis

January 3, 2023/WACV23

This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

October 24, 2022/ECCV 2022

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available

Learning Phase Mask for Privacy-Preserving Passive Depth Estimation

October 24, 2022/ECCV 2022

With over a billion sold each year, cameras are not only becoming ubiquitous, but are driving progress in a wide range of domains such as mixed reality, robotics, and more. However, severe concerns regarding the privacy implications of camera-based solutions currently limit the range of environments

Learning Semantic Segmentation from Multiple Datasets with Label Shifts

October 24, 2022/ECCV 2022

While it is desirable to train segmentation models on an aggregation of multiple datasets, a major challenge is that the label space of each dataset may be in conflict with one another. To tackle this challenge, we propose UniSeg, an effective and model-agnostic approach to automatically train segmentation

Single-Stream Multi-level Alignment for Vision-Language Pretraining

October 24, 2022/ECCV 2022

Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable

Controllable Dynamic Multi-Task Architectures

June 19, 2022/CVPR'22

Multi-task learning commonly encounters competition for resources among tasks, specifically when model capacity is limited. This challenge motivates models which allow control over the relative importance of tasks and total compute cost during inference time. In this work, we propose such a controllable

Learning to Learn across Diverse Data Biases in Deep Face Recognition

June 19, 2022/CVPR’22

Convolutional Neural Networks have achieved remarkable success in face recognition, in part due to the abundant availability of data. However, the data used for training CNNs is often imbalanced. Prior works largely focus on the long-tailed nature of face datasets in data volume per identity or focus

MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation

June 19, 2022/CVPR'22

Test-time adaptation approaches have recently emerged as a practical solution for handling domain shift without access to the source domain data. In this paper, we propose and explore a new multi-modal extension of test-time adaptation for 3D semantic segmentation. We find that, directly applying existing

On Generalizing Beyond Domains in Cross-Domain Continual Learning

June 19, 2022/CVPR'22

Humans have the ability to accumulate knowledge of new tasks in varying conditions, but deep neural networks of-ten suffer from catastrophic forgetting of previously learned knowledge after learning a new task. Many recent methods focus on preventing catastrophic forgetting under the assumption of train

Weakly But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts

June 19, 2022/CVPR'22

We propose an end-to-end network that takes a single perspective RGB image of a complex road scene as input, to produce occlusion-reasoned layouts in perspective space as well as a parametric bird’s-eye-view (BEV) space. In contrast to prior works that require dense supervision such as semantic labels

Confidence and Dispersity Speak – Characterizing Prediction Matrix for Unsupervised Accuracy Estimation

February 2, 2022/arXiv

This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain, dispersity indicates

Learning Cross-Modal Contrastive Features for Video Domain Adaptation

October 11, 2021/ICCV 2021, Virtual

Learning transferable and domain adaptive feature representations from videos is important for video-relevant tasks such as action recognition. Existing video domain adaptation methods mainly rely on adversarial feature alignment, which has been derived from the RGB image space. However, video data is

Cross-Domain Similarity Learning for Face Recognition in Unseen Domains

June 19, 2021/CVPR 2021, Virtual

Face recognition models trained under the assumption of identical training and test distributions often suffer from poor generalization when faced with unknown variations, such as a novel ethnicity or unpredictable individual make-ups during test time. In this paper, we introduce a novel cross-domain

Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction

June 19, 2021/CVPR 2021, Virtual

Trajectory prediction is a safety-critical tool for autonomous vehicles to plan and execute actions. Our work addresses two key challenges in trajectory prediction, learning multimodal outputs, and better predictions by imposing constraints using driving knowledge. Recent methods have achieved strong

Fusing the Old with the New: Learning Relative Pose with Geometry-Guided Uncertainty

June 19, 2021/CVPR 2021, Virtual

Learning methods for relative camera pose estimation have been developed largely in isolation from classical geometric approaches. The question of how to integrate predictions from deep neural networks (DNNs) and solutions from geometric solvers, such as the 5-point algorithm [37], has as yet remained

Cross-Modality 3D Object Detection

January 5, 2021/WACV 2021, Virtual

In this paper, we focus on exploring the fusion of images and point clouds for 3D object detection in view of the complementary nature of the two modalities, i.e., images possess more semantic information while point clouds specialize in distance sensing. To this end, we present a novel two-stage multi-modal

Set Augmented Triplet Loss for Video Person Re-Identification

January 5, 2021/WACV 2021, Virtual

Modern video person re-identification (re-ID) machines are often trained using a metric learning approach, supervised by a triplet loss. The triplet loss used in video re-ID is usually based on so-called clip features, each aggregated from a few frame features. In this paper, we propose to model the

Channel Recurrent Attention Networks for Video Pedestrian Retrieval

November 30, 2020/ACCV 2020, Kyoto, Japan

Full attention, which generates an attention value per element of the input feature maps, has been successfully demonstrated to be beneficial in visual tasks. In this work, we propose a fully attentional network, termed channel recurrent attention network, for the task of video pedestrian retrieval.

Uncertainty Aware Physically Guided Proxy Tasks for Unseen Domain Face Anti-Spoofing

November 20, 2020/arXiv

Face anti-spoofing (FAS) seeks to discriminate genuine faces from fake ones arising from any type of spoofing attack. Due to the wide variety of attacks, it is implausible to obtain training data that spans all attack types. We propose to leverage physical cues to attain better generalization on unseen

Voting Based Approaches For Differentially Private Federated Learning

October 6, 2020/arXiv

Differentially Private Federated Learning (DPFL) is an emerging field with many applications. Gradient averaging-based DPFL methods require costly communication rounds and hardly work with large capacity models due to the explicit dimension dependence in its added noise. In this work, inspired by knowledge

Adaptation Across Extreme Variations using Unlabeled Bridges

September 7, 2020/BMVC’20, Manchester, UK

We tackle an unsupervised domain adaptation problem for which the domain discrepancy between labeled source and unlabeled target domains is large, due to many factors of inter- and intra-domain variation. While deep domain adaptation methods have been realized by reducing the domain discrepancy, these

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction

August 28, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling

Domain Adaptive Semantic Segmentation using Weak Labels

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human oracle in a new weakly-supervised domain adaptation (WDA)

Image Stitching and Rectification for Hand-Held Cameras

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

In this paper, we derive a new differential homography that can account for the scanline-varying camera poses in Rolling Shutter (RS) cameras, and demonstrate its application to carry out RS-aware image stitching and rectification at one stroke. Despite the high complexity of RS geometry, we focus in

Improving Face Recognition by Clustering Unlabeled Faces in the Wild

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

While deep face recognition has benefited significantly from large-scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction

Learning to Optimize Domain Specific Normalization for Domain Generalization

August 23, 2020/ECCV 2020 – The 16th European Conference on Computer Vision, Glasgow, UK

We propose a simple but effective multi-source domain generalization technique based on deep neural networks by incorporating optimized normalization layers that are specific to individual domains. Our approach employs multiple normalization methods while learning separate affine parameters per domain.

Object Detection with a Unified Label Space from Multiple Datasets

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant—application-relevant categories can be picked and merged form

Shuffle and Attend: Video Domain Adaptation

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We address the problem of domain adaptation in videos for the task of human action recognition. Inspired by image-based domain adaptation, we can perform video adaptation by aligning the features of frames or clips of source and target videos. However, equally aligning all clips is sub-optimal as not

SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction

August 23, 2020/ECCV 2020 - The 16th European Conference on Computer Vision, Glasgow, UK

We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data,

Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv)

July 10, 2020

Read Improving Face Recognition by Clustering Unlabeled Faces in the Wild (arXiv). While deep face recognition has benefited significantly from large scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior

Peek-a-boo: Occlusion Reasoning in Indoor Scenes with Plane Representations

June 16, 2020/CVPR 2020

We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts

Private-kNN Practical Differential Privacy for Computer Vision

June 16, 2020/CVPR 2020

With increasing ethical and legal concerns on privacy for deep models in visual recognition, differential privacy has emerged as a mechanism to disguise membership of sensitive data in training datasets. Recent methods like Private Aggregation of Teacher Ensembles (PATE) leverage a large ensemble of

Towards Universal Representation Learning for Deep Face Recognition

June 16, 2020/CVPR 2020

Recognizing wild faces is extremely hard as they appear with all kinds of variations. Traditional methods either train with specifically annotated variation data from target domains, or by introducing unlabeled target variation data to adapt from the training data. Instead, we propose a universal representation

Understanding Road Layout from Videos as a Whole

June 16, 2020/CVPR 2020

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work,

Active Adversarial Domain Adaptation

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We propose an active learning approach for transferring representations across domains. Our approach, active adversarial domain adaptation (AADA), explores a duality between two related problems: adversarial domain alignment and importance sampling for adapting models across domains. The former uses

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We present an audio-visual multimodal approach for the task of zero-shot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities

DAVID: Dual-Attentional Video Deblurring

March 2, 2020/WACV 2020, Snowmass Village, CO USA

Blind video deblurring restores sharp frames from a blurry sequence without any prior. It is a challenging task because the blur due to camera shake, object movement and defocusing is heterogeneous in both temporal and spatial dimensions. Traditional methods train on datasets synthesized with a single

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition

Video Person Re-Identification using Learned Clip Similarity Aggregation

March 2, 2020/WACV 2020, Snowmass Village, CO USA

We address the challenging task of video-based person re-identification. Recent works have shown that splitting the video sequences into clips and then aggregating clip-based similarity is appropriate for the task. We show that using a learned clip similarity aggregation function allows filtering out

Adversarial Learning of Privacy-Preserving and Task-Oriented Representations

February 7, 2020/AAAI 2020, New York, New York USA

Data privacy has emerged as an important issue as data-driven deep learning has been an essential component of modern machine learning systems. For instance, there could be a potential privacy risk of machine learning systems via the model inversion attack, whose goal is to reconstruct the input data

Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM

November 3, 2019/IROS 2019, The Venetian Macao, Macau, China

Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Mapping (SLAM) systems, especially in driving scenarios. In this paper,

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

November 3, 2019/IROS 2019, The Venetian Macao, Macau, China

We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D data to perform 3D object detection and localization. We show

Domain Adaptation for Structured Output via Discriminative Patch Representations

October 27, 2019/ICCV 2019 - International Conference on Computer Vision 2019, Seoul, Korea

Predicting structured outputs such as semantic segmentation relies on expensive per-pixel annotations to learn supervised models like convolutional neural networks. However, models trained on one data domain may not generalize well to other domains without annotations for model finetuning. To avoid the

GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition

October 27, 2019/ICCV 2019 - International Conference on Computer Vision 2019, Seoul, Korea

Traditional intrinsic image decomposition focuses on decomposing images into reflectance and shading, leaving surfaces normals and lighting entangled in shading. In this work, we propose a Global-Local Spherical Harmonics (GLoSH) lighting model to improve the lighting component, and jointly predict reflectance

Deep Supervision with Intermediate Concepts (IEEE)

August 1, 2019/IEEE Transactions on Pattern Analysis and Machine Intelligence

Read Deep Supervision with Intermediate Concepts (IEEE). Recent data-driven approaches to scene interpretation predominantly pose inference as an end-to-end black-box mapping, commonly performed by a Convolutional Neural Network (CNN). However, decades of work on perceptual organization in both human

Pose-variant 3D Facial Attribute Generation

July 23, 2019/arXiv

We address the challenging problem of generating facial attributes using a single image in an unconstrained pose. In contrast to prior works that largely consider generation on 2D near-frontal images, we propose a GAN-based framework to generate attributes directly on a dense 3D representation given

A Dataset for High-Level 3D Scene Understanding of Complex Road Scenes in the Top-View

June 17, 2019/Proceedings of CVPR 2019 Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

We introduce a novel dataset for high-level 3D scene understanding of complex road scenes. Our annotations extend the existing datasets KITTI [5] and nuScenes [1] with semantically and geometrically meaningful attributes like the number of lanes or the existence of, and distance to, intersections, sidewalks

A Parametric Top-View Representation of Complex Road Scenes

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

In this paper, we address the problem of inferring the layout of complex road scenes given a single camera as input. To achieve that, we first propose a novel parameterized model of road layouts in a top-view representation, which is not only intuitive for human visualization but also provides an interpretable

Feature Transfer Learning for Face Recognition with Under-Represented Data

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

Despite the large volume of face recognition datasets, there is a significant portion of subjects, of which the samples are insufficient and thus under-represented. Ignoring such significant portion results in insufficient training data. Training with under-represented data leads to biased classifiers

Gotta Adapt Em All: Joint Pixel and Feature-Level Domain Adaptation for Recognition in the Wild

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a

Learning Structure-And-Motion-Aware Rolling Shutter Correction

June 16, 2019/IEEE Computer Vision and Pattern Recognition (CVPR 2019)

An exact method of correcting the rolling shutter (RS) effect requires recovering the underlying geometry, i.e. the scene structures and the camera motions between scanlines or between views. However, the multiple-view geometry for RS cameras is much more complicated than its global shutter (GS) counterpart,

Neural Collaborative Subspace Clustering

June 9, 2019/International Conference on Machine Learning, ICML 2019, Long Beach, CA USA

We introduce the Neural Collaborative Subspace Clustering, a neural model that discovers clusters of data points drawn from a union of low-dimensional subspaces. In contrast to previous attempts, our model runs without the aid of spectral clustering. This makes our algorithm one of the kinds that can

Unsupervised Domain Adaptation for Distance Metric Learning

May 6, 2019/Seventh International Conference on Learning Representations (ICLR 2019)

Unsupervised domain adaptation is a promising avenue to enhance the performance of deep neural networks on a target domain, using labels only from a source domain. However, the two predominant methods, domain discrepancy reduction learning and semi-supervised learning, are not readily applicable when

Learning To Simulate

May 6, 2019/Seventh International Conference on Learning Representations (ICLR 2019)

Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling

Attentive Conditional Channel-Recurrent Autoencoding for Attribute-Conditioned Face Synthesis

January 8, 2019/Winter Conference on Applications of Computer Vision (WACV) 2019, Waikoloa Village, Hawaii USA

Attribute-conditioned face synthesis has many potential use cases, such as to aid the identification of a suspect or a missing person. Building on top of a conditional version of VAE-GAN, we augment the pathways connecting the latent space with channel-recurrent architecture, in order to provide not

Memory Warps for Long-Term Online Video Representations and Anticipation

January 8, 2019/Winter Conference on Applications of Computer Vision (WACV) 2019, Waikoloa Village, Hawaii USA

We propose a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore motion when aligning features over time, or operate in an off-line mode to utilize future frames.

Scalable Deep k-Subspace Clustering

December 2, 2018/ACCV 2018, Perth, Australia

Subspace clustering algorithms are notorious for their scalability issues because building and processing large affinity matrices are demanding. In this paper, we introduce a method that simultaneously learns an embedding space along subspaces within it to minimize a notion of reconstruction error, thus

Unseen Object Segmentation in Videos via Transferable Representations

December 2, 2018/ACCV 2018

In order to learn object segmentation models in videos, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit existing annotations in source images and transfer such

Learning Gibbs-Regularized Pushforward Density Estimators with a Symmetric KL Objective

October 11, 2018/BayLearn Symposium 2018, Menlo Park, CA USA

We claim that there is currently no satisfactory way to regularize a generative adversarial network (GAN): neither the generator nor discriminator is particularly amenable to the imposition of inductive biases derived from domain knowledge. A generator is effectively a causal model of generationone

Unsupervised Cross Domain Distance Metric Adaptation with Feature Transfer Network

October 11, 2018/BayLearn Symposium 2018, Menlo Park, CA USA

Unsupervised domain adaptation is an attractive avenue to enhance the performance of deep neural networks in a target domain, using labels only from a source domain. However, two predominant methods along this line, namely, domain divergence reduction learning and semi-supervised learning, are not readily

Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences

September 8, 2018/European Conference on Computer Vision - ECCV 2018, Munich, Germany

Interest point descriptors have fueled progress on almost every problem in computer vision. Recent advances in deep neural networks have enabled task-specific learned descriptors that outperform hand-crafted descriptors on many problems. We demonstrate that commonly used metric learning approaches do

R2P2: A Reparameterized Pushforward Policy for Diverse, Precise Generative Path Forecasting

September 8, 2018/European Conference on Computer Vision - ECCV 2018, Munich, Germany

We propose a method to forecast a vehicle’s ego-motion as a distribution over spatiotemporal paths, conditioned on features (e.g., from LIDAR and images) embedded in an overhead map. The method learns a policy inducing a distribution over simulated trajectories that is both diverse (produces most paths

Learning to Look around Objects for Top-View Representations of Outdoor Scenes

September 8, 2018/European Conference on Computer Vision – ECCV 2018, Munich, Germany

Given a single RGB image of a complex outdoor road scene in the perspective view, we address the novel problem of estimating an occlusion-reasoned semantic scene layout in the top-view. This challenging problem not only requires an accurate understanding of both the 3D geometry and the semantics of the

Zero-Shot Object Detection

September 8, 2018/European Conference on Computer Vision - ECCV 2018, Munich, Germany

We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification.

Fast and Accurate Online Video Object Segmentation via Tracking Parts

June 18, 2018/Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT USA

Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for

Learning to Adapt Structured Output Space for Semantic Segmentation

June 18, 2018/Conference on Computer Vision and Pattern Recognition (CVPR) 2018, Salt Lake City, UT USA

Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the

Memory Warps for Learning Long-Term Online Video Representations

March 28, 2018/arXiv

This paper proposes a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore actual motion when aligning features over time, or operate in an off-line mode to utilize

Feature Transfer Learning for Deep Face Recognition with Long-Tail Data

March 23, 2018/arXiv

Real-world face recognition datasets exhibit long-tail characteristics, which results in biased classifiers in conventionally-trained deep neural networks, or insufficient data when long-tail classes are ignored. In this paper, we propose to handle long-tail classes in the training of a face recognition

Channel-Recurrent Autoencoding for Image Modeling

March 14, 2018/WACV 2018, Lake Tahoe, Nevada USA

Despite recent successes in synthesizing faces and bedrooms, existing generative models struggle to capture more complex image types (Figure 1), potentially due to the oversimplification of their latent space constructions. To tackle this issue, building on Variational Autoencoders (VAEs), we integrate

SVBRDF-Invariant Shape and Reflectance Estimation from a Light-Field Camera

February 22, 2018/IEEE Transactions on Pattern Analysis and Machine Intelligence

Light-field cameras have recently emerged as a powerful tool for one-shot passive 3D shape capture. However, obtaining the shape of glossy objects like metals or plastics remains challenging, since standard Lambertian cues like photo-consistency cannot be easily applied. In this paper, we derive a spatially-varying

Joint Pixel and Feature-level Domain Adaptation in the Wild

February 5, 2018/arXiv

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a

Learning random-walk label propagation for weakly-supervised semantic segmentation

February 1, 2018/arXiv

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data for this task relative to other vision tasks. We propose a novel training approach to address this difficulty. Given cheaply-obtained sparse image labelings, we propagate the sparse labels to produce

A 4D Light-Field Dataset & CNN Architectures for Material Recognition

August 24, 2016/ECCV 2016

We introduce a new light-field dataset of materials and take advantage of the recent success of deep learning to perform material recognition on the 4D light field. Our dataset contains 12 material categories, each with 100 images taken with a Lytro Illum, from which we extract about 30,000 patches in

A Continuous Occlusion Model for Road Scene Understanding

June 27, 2016/2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

We present a physically interpretable 3D model for handling occlusions with applications to road scene understanding. Given object detection and SFM point tracks, our unified model probabilistically assigns point tracks to objects and reasons about object detection scores and bounding boxes. It uniformly

WarpNet: Weakly Supervised Matching for Single-View Reconstruction

June 1, 2016/CVPR 2016

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised discriminative learning

Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos

March 7, 2016/CVPR 2016

We propose a novel framework for monocular traffic scene recognition, relying on a decomposition into high-order and atomic scenes to meet those challenges. High-order scenes carry semantic meaning useful for AWS applications, while atomic scenes are easy to learn and represent elemental behaviors based

Attribute2Image: Conditional Image Generation From Visual Attributes

December 2, 2015/ECCV 2016, The 14th European Conference on Computer Vision (2016)

This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment