Posts

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform $ extit(Unknown sysvar: (pseudo))$ RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires $ extit(Unknown sysvar: ( unlabeled monocular))$ videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised $ extit(Unknown sysvar: (monocular))$ and $ extit(Unknown sysvar: (stereo))$ depth prediction networks (e.g, Monodepth2) and feature based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.

Image Stitching and Rectification for Hand-Held Cameras

Image Stitching and Rectification for Hand-Held Cameras In this paper, we derive a new differential homography that can account for the scanline-varying camera poses in Rolling Shutter (RS) cameras, and demonstrate its application to carry out RS-aware image stitching and rectification at one stroke. Despite the high complexity of RS geometry, we focus in this paper on a special yet common input — two consecutive frames from a video stream, wherein the inter-frame motion is restricted from being arbitrarily large. This allows us to adopt simpler differential motion model, leading to a straightforward and practical minimal solver. To deal with non-planar scene and camera parallax in stitching, we further propose an RS-aware spatially-varying homogarphy field in the principle of As-Projective-As-Possible (APAP). We show superior performance over state-of-the-art methods both in RS image stitching and rectification, especially for images captured by hand-held shaking cameras.

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant ( g $O(100)$) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a “global” loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.

Learning Monocular Visual Odometry via Self Supervised Long Term Modeling

Learning Monocular Visual Odometry via Self Supervised Long Term Modeling Monocular visual odometry (VO) suffers severely from error accumulation during frame to frame pose estimation. In this paper, we present a self supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long term dependency in pose prediction using a pose network that features a two layer convolutional LSTM module. We train the networks with purely self supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (e.g., O(100)) frames. Given GPU memory constraints, we propose a stage wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a “global” loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB D.

Pseudo RGB D for Self Improving Monocular SLAM and Depth Prediction

Pseudo RGB D for Self Improving Monocular SLAM and Depth Prediction Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. Specifically, we propose a joint narrow and wide baseline based self improving framework, where on the one hand the CNN predicted depth is leveraged to perform pseudo RGB D feature based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires unlabeled monocular videos in both training and inference stages, and yet is able to outperform state of the art self supervised monocular and stereo depth prediction networks (e.g, Monodepth2) and feature based monocular SLAM system (i.e, ORB SLAM). Extensive experiments on KITTI and TUM RGB D datasets verify the superiority of our self improving geometry CNN framework.