Visual Odometry (VO) SLAM (Simultaneous Localization and Mapping) is an integrated system that provides accurate and continuous motion estimates for effective navigation and mapping applications.

Visual odometry (VO) is a technique used in robotics and computer vision to estimate the motion of a vehicle or a camera by analyzing the changes in visual input over time. It relies on computer vision algorithms to track features or keypoints in consecutive images and compute the relative camera motion between frames. By continuously updating the camera’s pose (position and orientation), visual odometry provides information about the movement of the camera or vehicle.

SLAM (Simultaneous Localization and Mapping) is a broader concept that involves the real-time construction or updating of a map of an environment while simultaneously determining the location of the mapping device (such as a robot or a camera) within that environment. SLAM integrates sensor measurements, such as visual information, with the estimated motion of the device to build a map and localize the device within it.


Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant ( g $O(100)$) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a “global” loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.