Deep Supervision with Intermediate Concepts

Chi Li1 M. Zeeshan Zia2 Quoc-Huy Tran3 Xiang Yu3 Gregory D. Hager1 Manmohan Chandraker3,4
1Johns Hopkins University 2Microsoft 3NEC Labs America 4University of California, San Diego
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018
(Top) A concept hierarchy with three concepts {y1, y2, y3} on a 2D input space. Dash arrows indicate the finer decomposition within the previous concept in the hierarchy. Each color represents one individual class defined by the concept. (Bottom) Deep supervision with three concepts {y1, y2, y3}.


Recent data-driven approaches to scene interpretation predominantly pose inference as an end-to-end black-box mapping, commonly performed by a Convolutional Neural Network (CNN). However, decades of work on perceptual organization in both human and machine vision suggest that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this work, we explore an approach for injecting prior domain structure into neural network training by supervising hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method. One advantage of this approach is that we are able to train only from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, but apply the results to real images. Our implementation achieves the state-of-the-art performance of 2D/3D keypoint localization and image classification on real image benchmarks including KITTI, PASCAL VOC, PASCAL3D+, IKEA, and CIFAR100. We provide additional evidence that our approach outperforms alternative forms of supervision, such as multi-task networks.


Deep Supervision with Intermediate Concepts
Chi Li, M. Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018
[PDF]  [Bibtex]

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
Chi Li, M. Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D. Hager, Manmohan Chandraker
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[PDF]  [Supp]  [Bibtex]

Image Classification Results on CIFAR100

Classification error of different methods on CIFAR100. The first four are previous methods and pre-act ResNet-1001 is the current state-of-the-art. The remaining four are results of our method (DISCO) and its variants.

Keypoint Localization Results on KITTI-3D

PCK [alpha=0.1] accuracies (%) of different methods for 2D and 3D keypoint localization on KITTI-3D dataset. Last column represents angular error in degrees. WN-gt-yaw uses groundtruth pose of the test car. The bold numbers indicates the best result on groundtruth object bounding boxes. The last row presents the accuracies of our method (DISCO) on detection results from RCNN.

Keypoint Localization Results on PASCAL VOC

PCK [alpha=0.1] accuracies (%) of different methods for 2D keypoint localization on the car category of PASCAL VOC. Bold numbers indicate the best results.

Object Segmentation Results on PASCAL3D+

Object segmentation accuracies (%) of different methods on PASCAL3D+. Best results are shown in bold.

Qualitative Results on KITTI-3D and PASCAL VOC

Visualization of 2D/3D prediction, visibility inference and instance segmentation on KITTI-3D (left) and PASCAL VOC (right). Last row shows failure cases. Circles and lines represent keypoints and their connections. Red and green indicate the left and right sides of a car, orange lines connect two sides. Dashed lines connect keypoints if one of them is inferred to be occluded. Light blue masks present segmentation results.

Keypoint Localization Results on IKEA

3D PCK curves of our method (DISCO) and 3D-INN on sofa (a), chair (b) and bed (c) classes of IKEA dataset. In each figure, X axis stands for alpha of PCK and Y axis represents the accuracy.

Qualitative Results on IKEA

Qualitative comparison between 3D-INN and our method (DISCO) for 3D structure prediction on IKEA dataset.


Part of this work was done during Chi Li’s internship at NEC Labs America. We acknowledge the support by NSF under grants IIS-127228 and IIS-1637949. We also thank Rene Vidal, Alan L. Yuille, Austin Reiter and Chong You for helpful discussions. This website template is inspired by this website.