Memory Warps for Long-Term Online Video Representations and Anticipation

We propose a novel memory-based online video representation that is efficient, accurate and predictive. This is in contrast to prior works that often rely on computationally heavy 3D convolutions, ignore motion when aligning features over time, or operate in an off-line mode to utilize future frames. In particular, our memory (i) holds the feature representation, (ii) is spatially warped over time to compensate for observer and scene motions, (iii) can carry long-term information, and (iv) enables predicting feature representations in future frames. By exploring a variant that operates at multiple temporal scales, we efficiently learn across even longer time horizons. We apply our online framework to object detection in videos, obtaining a large 2.3 times speed-up and losing only 0.9% mAP on ImageNet-VID dataset, compared to prior works that even use future frames. Finally, we demonstrate the predictive property of our representation in two novel detection setups, where features are propagated over time to (i) significantly enhance a real-time detector by more than 10% mAP in a multi-threaded online setup and to (ii) anticipate objects in future frames.

Attentive Conditional Channel-Recurrent Autoencoding for Attribute-Conditioned Face Synthesis

Attribute-conditioned face synthesis has many potential use cases, such as to aid the identification of a suspect or a missing person. Building on top of a conditional version of VAE-GAN, we augment the pathways connecting the latent space with channel-recurrent architecture, in order to provide not only improved generation qualities but also interpretable high-level features. In particular, to better achieve the latter, we further propose an attention mechanism over each attribute to indicate the specific latent subset responsible for its modulation. Thanks to the latent semantics formed via the channel-recurrency, we envision a tool that takes the desired attributes as inputs and then performs a 2-stage general-to-specific generation of diverse and realistic faces. Lastly, we incorporate the progressive-growth training scheme to the inference, generation and discriminator networks of our models to facilitate higher resolution outputs. Evaluations are performed through both qualitative visual examination and quantitative metrics, namely inception scores, human preferences, and attribute classification accuracy.

41.5-Tb/s Transmission Over 549 km of Field Deployed Fiber Using Throughput Optimized Probabilistic-Shaped 144QAM

We demonstrate high spectral efficiency transmission over 549 km of field-deployed single-mode fiber using probabilistic-shaped 144QAM. We achieved 41.5 Tb/s over the C-band at a spectral efficiency of 9.02 b/s/Hz using 32-Gbaud channels at a channel spacing of 33.33 GHz, and 38.1 Tb/s at a spectral efficiency of 8.28 b/s/Hz using 48-Gbaud channels at a channel spacing of 50 GHz. To the best of our knowledge, these are the highest total capacities and spectral efficiencies reported in a metro field environment using C-band only. In high spectral efficiency transmission, it is necessary to optimize back-to-back performance in order to maximize the link loss margin. Our results are enabled by the joint optimization of constellation shaping and coding overhead to minimize the gap to Shannon’s capacity, transmitter- and receiver-side digital backpropagation, signal clipping optimization, and I/Q imbalance compensation.

Battery Degradation Temporal Modeling Using LSTM Networks

Accurate modeling of battery capacity degradation is an important component for both battery manufacturers and energy management systems. In this paper, we develop a battery degradation model using deep learning algorithms. The model is trained with the real data collected from battery storage solutions installed and operated for behind-the-meter customers. In the dataset, battery operation data are recorded at a small scale (five minutes) and battery capacity is measured at every six months. In order to improve the training performance, we apply two preprocessing techniques, namely subsampling and feature extraction on operation data, and also interpolating between capacity measurements at times for which battery operation features are available. We integrate both cyclic and calendar aging processes in a unified framework by extracting the corresponding features from operation data. The proposed model uses LSTM units followed by a fully-connected network to process weekly battery operation features and predicts the capacity degradation. The experimental results show that our method can accurately predict the capacity fading and significantly outperforms baseline models including persistence and autoregressive (AR) models.

Conditioning Neural Networks: A Case Study of Electrical Load Forecasting

Machine learning tasks typically involve minimizing a loss function that measures the distance of the model output and the ground-truth. In some applications, in addition to the usual loss function, the output must also satisfy certain requirements for further processing. We call such requirements model conditioning. We investigate cases where the conditioner is not differentiable or cannot be expressed in closed form and, hence, cannot be directly included in the loss function of the machine learning model. We propose to replace the conditioner with a learned dummy model which is applied on the output of the main model. The entire model, composed of the main and dummy models, is trained end-to-end. Throughout training, the dummy model learns to approximate the conditioner and, thus, forces the main model to generate outputs that satisfy the specified requirements. We demonstrate our approach on a use-case of demand charge-aware electricity load forecasting. We show that jointly minimizing the error in forecast load and its demand charge threshold results in significant improvement to existing load forecast methods.

Visual Entailment Task for Visually-Grounded Language Learning

We introduce a new inference task – Visual Entailment (VE) – which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30K. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

SkyRAN: A Self-Organizing LTE RAN in the Sky

We envision a flexible, dynamic airborne LTE infrastructure built upon Unmanned Autonomous Vehicles (UAVs) that will provide on-demand, on-time, network access, anywhere. In this paper, we design, implement and evaluate SkyRAN, a self-organizing UAV-based LTE RAN (Radio Access Network) that is a key component of this UAV LTE infrastructure network. SkyRAN determines the UAV’s operating position in 3D airspace so as to optimize connectivity to all the UEs on the ground. It realizes this by overcoming various challenges in constructing and maintaining radio environment maps to UEs that guide the UAV’s position in real-time. SkyRAN is designed to be scalable in that it can be quickly deployed to provide efficient connectivity even over a larger area. It is adaptive in that it reacts to changes in the terrain and UE mobility, to maximize LTE coverage performance while minimizing operating overhead. We implement SkyRAN on a DJI Matrice 600 Pro drone and evaluate it over a 90 000 m2 operating area. Our testbed results indicate that SkyRAN can place the UAV in the optimal location with about 30 secs of a measurement flight. On an average, SkyRAN achieves a throughput of 0.9 – 0.95X of optimal, which is about 1.5 – 2X over other popular baseline schemes.

Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Recent studies have demonstrated the vulnerability of deep convolutional neural networks against adversarial examples. Inspired by the observation that the intrinsic dimension of image data is much smaller than its pixel space dimension and the vulnerability of neural networks grows with the input dimension, we propose to embed high-dimensional input images into a low-dimensional space to perform classification. However, arbitrarily projecting the input images to a low-dimensional space without regularization will not improve the robustness of deep neural networks. Leveraging optimal transport theory, we propose a new framework, Optimal Transport Classifier (OT-Classifier), and derive an objective that minimizes the discrepancy between the distribution of the true label and the distribution of the OT-Classifier output. Experimental results on several benchmark datasets show that, our proposed framework achieves state-of-the-art performance against strong adversarial attack methods.

Unseen Object Segmentation in Videos via Transferable Representations

In order to learn object segmentation models in videos, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit existing annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into two tasks: (1) solving a submodular function for selecting object-like segments, and (2) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between two tasks to self-learn the final solution for object segmentation. Experimental results on numerous benchmark datasets show that the proposed method performs favorably against the state-of-the-art algorithms.

Scalable Deep k-Subspace Clustering

Subspace clustering algorithms are notorious for their scalability issues because building and processing large affinity matrices are demanding. In this paper, we introduce a method that simultaneously learns an embedding space along subspaces within it to minimize a notion of reconstruction error, thus addressing the problem of subspace clustering in an end-to-end learning paradigm. To achieve our goal, we propose a scheme to update subspaces within a deep neural network. This in turn frees us from the need of having an affinity matrix to perform clustering. Unlike previous attempts, our method can easily scale up to large datasets, making it unique in the context of unsupervised learning with deep architectures. Our experiments show that our method significantly improves the clustering accuracy while enjoying cheaper memory footprints.