We present a real-time, accurate, large-scale monocular visual odometry system for real-world autonomous outdoor driving applications. The key contributions of our work are a series of architectural innovations that address the challenge of robust multithreading even for scenes with large motions and rapidly changing imagery. Our design is extensible for three or more parallel CPU threads. The system uses 3D-2D correspondences for robust pose estimation across all threads, followed by local bundle adjustment in the primary thread. In contrast to prior work, epipolar search operates in parallel in other threads to generate new 3D points at every frame. This significantly boosts robustness and accuracy, since only extensively validated 3D points with long tracks are inserted at keyframes. Fast-moving vehicles also necessitate immediate global bundle adjustment, which is triggered by our novel keyframe design in parallel with pose estimation in a thread-safe architecture. To handle inevitable tracking failures, a recovery method is provided. Scale drift is corrected only occasionally, using a novel mechanism that detects (rather than assumes) local planarity of the road by combining information from triangulated 3D points and the inter-image planar homography. Our system is optimized to output pose within 50 ms in the worst case, while average case operation is over 30 fps. Evaluations are presented on the challenging KITTI dataset for autonomous driving, where we achieve better rotation and translation accuracy than other state-of-the-art systems.
We propose novel multithreaded architectures for steady-state, keyframe and recovery operations that allow maintaining a stable set of 3D points with long and extensively validated tracks. This is crucial for autonomous driving where scene points rapidly disappear from field of view due to fast motions.
We demonstrate our initial performance on KITTI and Hague datasets. Note that this performance has been superseded by our CVPR 2014 paper. Compare the similarity between the GPS output overlaid on a map, with the trajectory obtained using our purely vision-based monocular SFM.
Our system is real-time and produces output at 33 frames per second on the average. In the worst case, output is produced within 50 milliseconds, which satisfies the constraints of autonomous driving. Timings for individual frames from two sequences of the KITTI dataset are shown here.
Our real-time monocular SFM is comparable in accuracy to state-of-the-art stereo systems and significantly outperforms other monocular systems. A few example sequences are shown here from the KITTI benchmark.
Last updated May 31, 2014.