Disentangled Recurrent Wasserstein Auto-Encoder Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.
Ranking-based Convolutional Neural Network Models for Peptide-MHC Binding Prediction T-cell receptors can recognize foreign peptides bound to major histocompatibility complex (MHC) class-I proteins, and thus trigger the adaptive immune response. Therefore, identifying peptides that can bind to MHC class-I molecules plays a vital role in the design of peptide vaccines. Many computational methods, for example, the state-of-the-art allele-specific method MHCflurry, have been developed to predict the binding affinities between peptides and MHC molecules. In this manuscript, we develop two allele-specific Convolutional Neural Network-based methods named ConvM and SpConvM to tackle the binding prediction problem. Specifically, we formulate the problem as to optimize the rankings of peptide-MHC bindings via ranking-based learning objectives. Such optimization is more robust and tolerant to the measurement inaccuracy of binding affinities, and therefore enables more accurate prioritization of binding peptides. In addition, we develop a new position encoding method in ConvM and SpConvM to better identify the most important amino acids for the binding events. We conduct a comprehensive set of experiments using the latest Immune Epitope Database (IEDB) datasets. Our experimental results demonstrate that our models significantly outperform the state-of-the-art methods including MHCflurry with an average percentage improvement of 6.70% on AUC and 17.10% on ROC5 across 128 alleles.
Read Overcoming Poor Word Embeddings with Word Definitions (arXiv). Modern natural language understanding models depend on pretrained subword embeddings, but applications may need to reason about words that were never or rarely seen during pretraining. We show that examples that depend critically on a rarer word are more challenging for natural language inference models. Then we explore how a model could learn to use definitions, provided in natural text, to overcome this handicap. Our model’s understanding of a definition is usually weaker than a well modeled word embedding, but it recovers most of the performance gap from using a completely untrained word.
A Multi-Scale Conditional Deep Model for Tumor Cell Ratio Counting We propose a method to accurately obtain the ratio of tumor cells over an entire histological slide. We use deep fully convolutional neural network models trained to detect and classify cells on images of H&E-stained tissue sections. Pathologists’ labels consisting of exhaustive nuclei locations and tumor regions were used to trained the model in a supervised fashion. We show that combining two models, each working at a different magnification allows the system to capture both cell-level details and surrounding context to enable successful detection and classification of cells as either tumor-cell or normal-cell. Indeed, by conditioning the classification of a single cell on a multi-scale context information, our models mimic the process used by pathologists who assess cell neoplasticity and tumor extent at different microscope magnifications. The ratio of tumor cells can then be readily obtained by counting the number of cells in each class. To analyze an entire slide, we split it into multiple tiles that can be processed in parallel. The overall tumor cell ratio can then be aggregated. We perform experiments on a dataset of 100 slides with lung tumor specimens from both resection and tissue micro-array (TMA). We train fully-convolutional models using heavy data augmentation and batch normalization. On an unseen test set, we obtain an average mean absolute error on predicting the tumor cell ratio of less than 6%, which is significantly better than the human average of 20% and is key in properly selecting tissue samples for recent genetic panel tests geared at prescribing targeted cancer drugs. We perform ablation studies to show the importance of training two models at different magnifications and to justify the choice of some parameters, such as the size of the receptive field.
Improving neural network robustness through neighborhood preserving layers One major source of vulnerability of neural nets in classification tasks is from overparameterized fully connected layers near the end of the network. In this paper, we propose a new neighborhood preserving layer which can replace these fully connected layers to improve the network robustness. Networks including these neighborhood preserving layers can be trained efficiently. We theoretically prove that our proposed layers are more robust against distortion because they effectively control the magnitude of gradients. Finally, we empirically show that networks with our proposed layers are more robust against state-of-the-art gradient descent-based attacks, such as a PGD attack on the benchmark image classification datasets MNIST and CIFAR10.
Prediction of Early Recurrence of Hepatocellular Carcinoma after Resection using Digital Pathology Images Assessed by Machine Learning Hepatocellular carcinoma (HCC) is a representative primary liver cancer caused by long-term and repetitive liver injury. Surgical resection is generally selected as the radical cure treatment. Because the early recurrence of HCC after resection is associated with low overall survival, the prediction of recurrence after resection is clinically important. However, the pathological characteristics of the early recurrence of HCC have not yet been elucidated. We attempted to predict the early recurrence of HCC after resection based on digital pathologic images of hematoxylin and eosin-stained specimens and machine learning applying a support vector machine (SVM). The 158 HCC patients meeting the Milan criteria who underwent surgical resection were included in this study. The patients were categorized into three groups: Group I, patients with HCC recurrence within 1 year after resection (16 for training and 23 for test), Group II, patients with HCC recurrence between 1 and 2 years after resection (22 and 28), and Group III, patients with no HCC recurrence within 4 years after resection (31 and 38). The SVM-based prediction method separated the three groups with 89.9% (80/89) accuracy. Prediction of Groups I was consistent for all cases, while Group II was predicted to be Group III in one case, and Group III was predicted to be Group II in 8 cases. The use of digital pathology and machine learning could be used for highly accurate prediction of HCC recurrence after surgical resection, especially that for early recurrence. Currently, in most cases after HCC resection, regular blood tests and diagnostic imaging are used for follow-up observation, however, the use of digital pathology coupled with machine learning offers potential as a method for objective postoprative follow-up observation.
Model-Based Autoencoders for Imputing Discrete single-cell RNA-seq Data Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing ‘false’ zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.
Tripping through time: Efficient Localization of Activities in Videos Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA , ActivityNet Captions  and the TACoS dataset , we find that TripNet achieves high accuracy and saves processing time by only looking at 32-41% of the entire video.
Improving Disentangled Text Representation Learning with Information Theoretical Guidance Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervision signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.