Amazon is one of the world’s largest technology companies, transforming commerce, logistics, cloud computing, and AI. With a mission centered on customer obsession and innovation, Amazon continues to redefine global consumer experiences. NEC Labs America collaborates with Amazon on large-scale machine learning, recommendation systems, and multimodal modeling. Our joint publications focus on personalized AI and scalable inference for consumer-facing platforms. Please read about our latest news and collaborative publications with Amazon.

Posts

TSLA: Unified Time Series and Language Model

Real-world time series data often require analysis or interpretation from domain experts. Some tasks, like time series question answering, involve both time series and natural language questions, posing challenges for single-modality language models to understand their interaction. To this end, we present TSLA (Time Series Language Model), a framework designed to enhance the language model with the understanding of time series data for multi-modality tasks. TSLA comprises three key components. (1) Time Series Tokenizer learns how to represent time series data into discrete tokens, making it more manageable for language models. (2) Joint (Pre-)Training on task-agnostic time series and text data integrates time series tokens and text tokens to model the interplay between time series and language concepts. (3) Multi-task Instruction Tuning fine-tunes the pretrained TSLA for various downstream tasks relevant to user interests. For evaluation, we applied TSLA to time series data from human motions on four tasks: time series captioning, time series question answering, text-based time series synthesis, and text-based time series continuation. The results demonstrate TSLA’s effectiveness in handling multiple time series analysis tasks, pointing the way for future research endeavors.

Attribute-Centric Compositional Text-to-Image Generation

Despite the recent impressive breakthroughs in text-to-image generation, generative models have difficulty in capturing thedata distribution of underrepresented attribute compositions while over-memorizing overrepresented attribute compositions,which raises public concerns about their robustness and fairness. To tackle this challenge, we propose ACTIG, an attributecentriccompositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novelimage-free training scheme, which greatly improves model’s ability to generate images with underrepresented attributes.Wefurther propose an attribute-centric contrastive loss to avoid overfitting to overrepresented attribute compositions.We validateour framework on the CelebA-HQ and CUB datasets. Extensive experiments show that the compositional generalization ofACTIG is outstanding, and our framework outperforms previous works in terms of image quality and text-image consistency

Dynamic Causal Discovery in Imitation Learning

Imitation learning, which learns agent policy by mimicking expert demonstration, has shown promising results in many applications such as medical treatment regimes and self-driving vehicles. However, it remains a difficult task to interpret control policies learned by the agent. Difficulties mainly come from two aspects: 1) agents in imitation learning are usually implemented as deep neural networks, which are black-box models and lack interpretability; 2) the latent causal mechanism behind agents’ decisions may vary along the trajectory, rather than staying static throughout time steps. To increase transparency and offer better interpretability of the neural agent, we propose to expose its captured knowledge in the form of a directed acyclic causal graph, with nodes being action and state variables and edges denoting the causal relations behind predictions. Furthermore, we design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. Concretely, we conduct causal discovery from the perspective of Granger causality and propose a self-explainable imitation learning framework, CAIL. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner. After the model is learned, we can obtain causal relations among states and action variables behind its decisions, exposing policies learned by it. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of the proposed CAIL in learning the dynamic causal graphs for understanding the decision-making of imitation learning meanwhilemaintaining high prediction accuracy.

Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of significant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that revolves around probing several similar image caption training instances (retrieval), performing analogical reasoning over relevant entities in retrieved prototypes (analogy), and enhancing the generation process with reasoning outcomes (composition). Our method augments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition-related evaluation metrics and conventional image captioning metrics.

Disentangled Recurrent Wasserstein Auto-Encoder

Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.

Behavior-based Community Detection: Application to Host Assessment in Enterprise Information Networks

Behavior-based Community Detection: Application to Host Assessment in Enterprise Information Networks Community detection in complex networks is a fundamental problem that attracts much attention across various disciplines. Previous studies have been mostly focusing on external connections between nodes (i.e., topology structure) in the network whereas largely ignoring internal intricacies (i.e., local behavior) of each node. A pair of nodes without any interaction can still share similar internal behaviors. For example, in an enterprise information network, compromised computers controlled by the same intruder often demonstrate similar abnormal behaviors even if they do not connect with each other. In this paper, we study the problem of community detection in enterprise information networks, where large-scale internal events and external events coexist on each host. The discovered host communities, capturing behavioral affinity, can benefit many comparative analysis tasks such as host anomaly assessment. In particular, we propose a novel community detection framework to identify behavior-based host communities in enterprise information networks, purely based on large-scale heterogeneous event data. We continue proposing an efficient method for assessing host’s anomaly level by leveraging the detected host communities. Experimental results on enterprise networks demonstrate the effectiveness of our model.