We conduct research in computer vision and machine learning, with a focus on achieving and sustaining excellence in two main directions: fine-grained image recognition and 3D reconstruction. Our research is to achieve technological breakthroughs in instance-level fine-grained recognition and visual 3D scene understanding for autonomous driving.
Fine-grained Image Recognition
We believe image recognition that effectively bridges the physical and information worlds must provide fine-grained information. A much wider domain of real-world applications are available for technology that recognizes not just a chair in a cellphone image, but also its specific model that can be cross-referenced with an online catalog. Key research challenges that we address in fine-grained recognition are dealing with a large number of visually similar categories and the paucity of large-scale fine-grained image data.
Our deep learning architecture for fine-grained image recognition learns hierarchically refined features from different levels of class granularity. This enables well-regularized deep learning that yields excellent results with limited training data (fine-grained labels are often expensive to obtain). Our current effort is focused towards building very deep neural networks.
Metric learning embeds high-dimensional data into a low-dimensional manifold, which is a popular approach to handle large within-class variation. Subsequent nearest-neighbor classification allows dealing with a large number of classes, up to millions in some applications. Our metric learning consistently outperforms linear SVM by 3~5% in classification accuracy. To deal with truly large-scale data (millions of images, feature dimensions in hundreds of thousands) we have developed a random projection based approach that solves the optimization in its dual, which allows metric learning to be as fast as linear SVM classifiers.
|Unsupervised Feature Learning
While supervised feature learning methods (such as deep neural networks) have recently set new benchmarks in image classification, unsupervised feature learning is a useful complement. In the past, our group has developed popular unsupervised feature learning approaches such as local linear embedding (LLC). We continue to expand this line of research, aiming towards a comprehensive picture of how feature learning enables fine-grained image recognition.
Boosting is a key supervised feature learning direction that we pursue to learn salient regions for fine-grained image recognition. Based on cascaded boosting, we have developed our Regionlet approach for object detection, which is a key component for our object centric feature learning in fine-grained image recognition.
|Object Centric Feature Learning with Object Detection
Detection is a crucial precursor to several applications such as face recognition. Detection eliminates background clutter, which enables fine-grained image recognition to capture subtle differences between visually similar classes. Our research addresses key challenges in generic object detection and demonstrates its benefits towards image classification.
Near-duplicate Image Retrieval
Near-duplicate image retrieval allows, for instance, capturing the image of a building and finding images of the same building in a photo album. Our technology relies on matching of local features such as SIFT. Our research in this area is focused on optimizing the construction of a vocabulary tree for better feature matching, for instance, by incorporating global semantic information and developing re-ranking approaches for verification.
3D Reconstruction and Scene Understanding
Humans rely on human vision to drive. Computer vision should play an essential role in sensing for autonomous driving. To that end, our group is developing visual 3D scene understanding technology. Simply speaking, visual 3D scene understanding technology takes video as input and is to detect the road objects (like pedestrians, cars, cyclists, etc.) and calculate their 3D coordinates relative to the self-car. It is obvious that visual 3D scene understanding technology would be essential in sensing surrounding environments for autonomous driving if (a big “if”) the technology can be accurate and robust enough.
|Real-Time Structure from Motion
Structure from motion (SFM) estimates camera motion and 3D position of salient points using a video sequence as input. We have developed state-of-the-art stereo and monocular SFM that operate at over 30 frames per second. Our monocular system effectively solves the problem of scale drift in driving scenarios, achieving top performance on the challenging KITTI dataset (2.5% translation error). It forms the backbone of our collaborations with leading automakers. Our current directions fuse sensory, map and scene information to achieve long-term robust operation up to manufacturer specifications.
|Object Detection in Videos
In contrast to object detection in an image, detecting objects in videos is more challenging due to motion blur, color distortion and real-time rates. We rely on our regionlets technology that uses an approach based on classification of proposal bounding boxes followed by relocalization. We demonstrate the framework to be very effective in detecting objects appearing in road scenes, such as pedestrians, cars and cyclists, achieving top performance on the KITTI dataset for each of those categories. We are also investigating on complementing regionlets with neural network methods.
|Multiple Target Tracking
Object tracking is crucial to integrate semantic object information over time and predict future locations. We work in a tracking-by-detection paradigm, where the key challenge in road scenes is to accurately associate detection outputs in crowded scenarios, while achieving accurate motion estimation and enhancing detection accuracy. We achieve top performance on the KITTI benchmark for both car and pedestrian categories.
|3D Scene Understanding
We combine the outputs of object detection with 3D information from SFM within a cognitive loop to accurately localize the position and orientation of each traffic participant, relative to a global map. Our current real-time framework localizes near objects within 7% error and distant objects within 12% error. Our current work integrates information from SFM, object tracking and video segmentation to produce localizations that are consistent with both scene elements and other traffic participants.
|BRDF and Illumination-Invariant Dense Reconstruction
Traditional methods for 3D reconstruction assume diffuse reflectance and simple lighting conditions. We develop novel reconstruction frameworks that demonstrate shape recovery is possible even with complex reflectance and illumination. Our current research extends traditional shape-from-motion approaches such as optical flow, multiview stereo and photometric stereo to handle such real-world challenges.
Dense Reconstruction with Semantic Priors
Traditional multiview stereo (MVS) faces challenges for textureless or glossy objects, or with wide camera baselines. We incorporate semantic information from learned class-level shape priors and object detection, to obtain dense reconstructions with up to 30% less error than traditional MVS. Our ongoing work considers semantic MVS for indoor scenes by exploiting scene-level priors such as geometric layouts.