The University of Maryland, College Park, founded in 1856, is a distinguished public research university. It is the largest university in Maryland and the Washington metropolitan area, and is located nine miles from downtown Washington, D.C. It is a public research university offering over 200 degree-granting programs across its eleven schools and colleges, attracting top students and renowned faculty from around the world. It is known for its innovative and challenging academic programs, strong research opportunities, and a vibrant multicultural community. NEC Labs America collaborates with the University of Maryland, College Park, on research spanning edge AI, distributed systems, machine learning interpretability, learning with limited labels, domain adaptation, and continual learning. Our research improves model generalization in dynamic conditions. We jointly study robust inference, communication-efficient federated learning, and privacy-aware analytics for large-scale deployments. Please read about our latest news and collaborative publications with the University of Maryland, College Park.

Posts

GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition

Traditional intrinsic image decomposition focuses on decomposing images into reflectance and shading, leaving surfaces normals and lighting entangled in shading. In this work, we propose a Global-Local Spherical Harmonics (GLoSH) lighting model to improve the lighting component, and jointly predict reflectance and surface normals. The global SH models the holistic lighting while local SH account for the spatial variation of lighting. Also, a novel non-negative lighting constraint is proposed to encourage the estimated SH to be physically meaningful. To seamlessly reflect the GLoSH model, we design a coarse-to-fine network structure. The coarse network predicts global SH, reflectance and normals, and the fine network predicts their local residuals. Lacking labels for reflectance and lighting, we apply synthetic data for model pre-training and fine-tune the model with real data in a self-supervised way. Compared to the state-of-the-art methods only targeting normals or reflectance and shading, our method recovers all components and achieves consistently better results on three real datasets, IIW, SAW and NYUv2.

Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis

Developing conditional generative models for text-to-video synthesis is an extremely challenging yet an important topic of research in machine learning. In this work, we address this problem by introducing Text-Filter conditioning Generative Adversarial Network (TFGAN), a conditional GAN model with a novel multi-scale text-conditioning scheme that improves text-video associations. By combining the proposed conditioning scheme with a deep GAN architecture, TFGAN generates high quality videos from text on challenging real-world video datasets. In addition, we construct a synthetic dataset of text-conditioned moving shapes to systematically evaluate our conditioning scheme. Extensive experiments demonstrate that TFGAN significantly outperforms existing approaches, and can also generate videos of novel categories not seen during training.

Zero-Shot Object Detection

We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD. We then discuss the problems associated with selecting a background class and motivate two background-aware approaches for learning robust detectors. One of these models uses a fixed background class and the other is based on iterative latent assignments. We also outline the challenge associated with using a limited number of training classes and propose a solution based on dense sampling of the semantic label space using auxiliary data with a large number of categories. We propose novel splits of two standard detection datasets – MSCOCO and VisualGenome, and present extensive empirical results in both the traditional and generalized zero-shot settings to highlight the benefits of the proposed methods. We provide useful insights into the algorithm and conclude by posing some open questions to encourage further research.

WarpNet: Weakly Supervised Matching for Single-View Reconstruction

Our WarpNet matches images of objects in fine-grained datasets without using part annotations. It aligns an object in one image with a different object in another by exploiting a fine-grained dataset to create artificial data for training a Siamese network with an unsupervised discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. This allows single-view reconstruction with quality comparable to using human annotation.