Self-Supervised Learning is a learning paradigm where a model is trained to predict certain aspects of its input data without explicit external labels. The model generates its own supervision signal, often by creating auxiliary tasks based on the input data.

Posts

Deep Video Codec Control for Vision Models

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However standard video codecs (e.g. H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.

Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction

Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform $ extit(Unknown sysvar: (pseudo))$ RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires $ extit(Unknown sysvar: ( unlabeled monocular))$ videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised $ extit(Unknown sysvar: (monocular))$ and $ extit(Unknown sysvar: (stereo))$ depth prediction networks (e.g, Monodepth2) and feature based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.