Media Analytics

Read our publications from our Media Analytics team who are overcoming fundamental challenges in computer vision and are addressing critical needs in mobility, security, safety and socially relevant AI. Our team solves fundamental challenges in computer vision, with a focus on understanding and interaction in 3D scenes, representation learning in visual and multimodal data, learning across domains and tasks, as well as responsible AI. Our technological breakthroughs contribute to socially-relevant solutions that address key enterprise needs in mobility, safety and smart spaces.

Posts

Feature Transfer Learning for Deep Face Recognition with Long-Tail Data

Real-world face recognition datasets exhibit long-tail characteristics, which results in biased classifiers in conventionally-trained deep neural networks, or insufficient data when long-tail classes are ignored. In this paper, we propose to handle long-tail classes in the training of a face recognition engine by augmenting their feature space under a center-based feature transfer framework. A Gaussian prior is assumed across all the head (regular) classes and the variance from regular classes are transferred to the long-tail class representation. This encourages the long-tail distribution to be closer to the regular distribution, while enriching and balancing the limited training data. Further, an alternating training regimen is proposed to simultaneously achieve less biased decision boundaries and a more discriminative feature representation. We conduct empirical studies that mimic long-tail datasets by limiting the number of samples and the proportion of long-tail classes on the MS-Celeb-1M dataset. We compare our method with baselines not designed to handle long-tail classes and also with state-of-the-art methods on face recognition benchmarks. State-of-the-art results on LFW, IJB-A and MS-Celeb-1M datasets demonstrate the effectiveness of our feature transfer approach and training strategy. Finally, our feature transfer allows smooth visual interpolation, which demonstrates disentanglement to preserve identity of a class while augmenting its feature space with non-identity variations.

Channel-Recurrent Autoencoding for Image Modeling

Despite recent successes in synthesizing faces and bedrooms, existing generative models struggle to capture more complex image types (Figure 1), potentially due to the oversimplification of their latent space constructions. To tackle this issue, building on Variational Autoencoders (VAEs), we integrate recurrent connections across channels to both inference and generation steps, allowing the high-level features to be captured in global-to-local, coarse-to-fine manners. Combined with adversarial loss, our channel-recurrent VAE-GAN (crVAE-GAN) outperforms VAE-GAN in generating a diverse spectrum of high resolution images while maintaining the same level of computational efficacy. Our model produces interpretable and expressive latent representations to benefit downstream tasks such as image completion. Moreover, we propose two novel regularizations, namely the KL objective weighting scheme over time steps and mutual information maximization between transformed latent variables and the outputs, to enhance the training.

SVBRDF-Invariant Shape and Reflectance Estimation from a Light-Field Camera

Light-field cameras have recently emerged as a powerful tool for one-shot passive 3D shape capture. However, obtaining the shape of glossy objects like metals or plastics remains challenging, since standard Lambertian cues like photo-consistency cannot be easily applied. In this paper, we derive a spatially-varying (SV)BRDF-invariant theory for recovering 3D shape and reflectance from light-field cameras. Our key theoretical insight is a novel analysis of diffuse plus single-lobe SVBRDFs under a light-field setup. We show that, although direct shape recovery is not possible, an equation relating depths and normals can still be derived. Using this equation, we then propose using a polynomial (quadratic) shape prior to resolve the shape ambiguity. Once shape is estimated, we also recover the reflectance. We present extensive synthetic data on the entire MERL BRDF dataset, as well as a number of real examples to validate the theory, where we simultaneously recover shape and BRDFs from a single image taken with a Lytro Illum camera.

Joint Pixel and Feature-level Domain Adaptation in the Wild

Recent developments in deep domain adaptation have allowed knowledge transfer from a labeled source domain to an unlabeled target domain at the level of intermediate features or input pixels. We propose that advantages may be derived by combining them, in the form of different insights that lead to a novel design and complementary properties that result in better performance. At the feature level, inspired by insights from semi-supervised learning in a domain adversarial neural network, we propose a novel regularization in the form of domain adversarial entropy minimization. Next, we posit that insights from computer vision are more amenable to injection at the pixel level and specifically address the key challenge of adaptation across different semantic levels. In particular, we use 3D geometry and image synthetization based on a generalized appearance flow to preserve identity across higher-level pose transformations, while using an attribute-conditioned CycleGAN to translate a single source into multiple target images that differ in lower-level properties such as lighting. We validate on a novel problem of car recognition in unlabeled surveillance images using labeled images from the web, handling explicitly specified, nameable factors of variation through pixel-level and implicit, unspecified factors through feature-level adaptation. Extensive experiments achieve state-of-the-art results, demonstrating the effectiveness of complementing feature and pixel-level information via our proposed domain adaptation method.

Learning random-walk label propagation for weakly-supervised semantic segmentation

Large-scale training for semantic segmentation is challenging due to the expense of obtaining training data for this task relative to other vision tasks. We propose a novel training approach to address this difficulty. Given cheaply-obtained sparse image labelings, we propagate the sparse labels to produce guessed dense labelings. A standard CNN-based segmentation network is trained to mimic these labelings. The label-propagation process is defined via random-walk hitting probabilities, which leads to a differentiable parameterization with uncertainty estimates that are incorporated into our loss. We show that by learning the label-propagator jointly with the segmentation predictor, we are able to effectively learn semantic edges given no direct edge supervision. Experiments also show that training a segmentation network in this way outperforms the naive approach.

A 4D Light-Field Dataset & CNN Architectures for Material Recognition

We introduce a new light-field dataset of materials and take advantage of the recent success of deep learning to perform material recognition on the 4D light field. Our dataset contains 12 material categories, each with 100 images taken with a Lytro Illum, from which we extract about 30,000 patches in total. To the best of our knowledge, this is the first mid-size dataset for light-field images.

A Continuous Occlusion Model for Road Scene Understanding

We present a physically interpretable 3D model for handling occlusions with applications to road scene understanding. Given object detection and SFM point tracks, our unified model probabilistically assigns point tracks to objects and reasons about object detection scores and bounding boxes. It uniformly handles static and dynamic objects, thus outperforming motion segmentation for association problems. It also demonstrates occlusion-aware 3D localization in road scenes.

WarpNet: Weakly Supervised Matching for Single-View Reconstruction

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation.

Atomic Scenes for Scalable Traffic Scene Recognition in Monocular Videos

We propose a novel framework for monocular traffic scene recognition, relying on a decomposition into high-order and atomic scenes to meet those challenges. High-order scenes carry semantic meaning useful for AWS applications, while atomic scenes are easy to learn and represent elemental behaviors based on 3D localization of individual traffic participants. We propose a novel hierarchical model that captures co-occurrence and mutual-exclusion relationships while incorporating both low-level trajectory features and high-level scene features, with parameters learned using a structured support vector machine. We propose efficient inference that exploits the structure of our model to obtain real-time rates.

Attribute2Image: Conditional Image Generation From Visual Attributes

This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.