Data Science and System Security

Read our publications from our Data Science & System Security researchers who aim to build novel big-data solutions and service platforms to simplify complex systems management. We develop new information technology that supports innovative applications, from big data analytics to the Internet of Things. Our experimental and theoretical research includes many data science and systems research domains including time series mining, deep learning, NLP and large language models, graph mining, signal processing, and cloud computing.

Posts

Towards Robust Graph Neural Networks via Adversarial Contrastive Learning

Graph Neural Network (GNN), as a powerful representation learning model on graph data, attracts much attention across various disciplines. However, recent studies show that GNN is vulnerable to adversarial attacks. How to make GNN more robust? What are the key vulnerabilities in GNN? How to address the vulnerabilities and defend GNN against the adversarial attacks? Adversarial training has shown to be effective in improving the robustness of traditional Deep Neural Networks (DNNs). However, existing adversarial training works mainly focus on the image data, which consists of continuous features, while the features and structures of graph data are often discrete. Moreover, rather than assuming each sample is independent and identically distributed as in DNN, GNN leverages the contextual information across the graph (e.g., neighborhoods of a node). Thus, existing adversarial training techniques cannot be directly applied to defend GNN. In this paper, we propose ContrastNet, an effective adversarial defense framework for GNN. In particular, we propose an adversarial contrastive learning method to train the GNN over the adversarial space. To further improve the robustness of GNN, we investigate the latent vulnerabilities in every component of a GNN encoder and propose corresponding refining strategies. Extensive experiments on three public datasets demonstrate the effectiveness of ContrastNet in improving the robustness of popular GNN variants, such as Graph Convolutional Network and GraphSage, under various types of adversarial attacks.

DeepGAR: Deep Graph Learning for Analogical Reasoning

Analogical reasoning is the process of discovering and mapping correspondences from a target subject to a base subject. As the most well-known computational method of analogical reasoning, Structure-Mapping Theory (SMT) abstracts both target and base subjects into relational graphs and forms the cognitive process of analogical reasoning by finding a corresponding subgraph (i.e., correspondence) in the target graph that is aligned with the base graph. However, incorporating deep learning for SMT is still under-explored due to several obstacles: 1) the combinatorial complexity of searching for the correspondence in the target graph, 2) the correspondence mining is restricted by various cognitive theory-driven constraints. To address both challenges, we propose a novel framework for Analogical Reasoning (DeepGAR) that identifies the correspondence between source and target domains by assuring cognitive theory-driven constraints. Specifically, we design a geometric constraint embedding space to induce subgraph relation from node embeddings for efficient subgraph search. Furthermore, we develop novel learning and optimization strategies that could end-to-end identify correspondences that are strictly consistent with constraints driven by the cognitive theory. Extensive experiments are conducted on synthetic and real-world datasets to demonstrate the effectiveness of the proposed DeepGAR over existing methods. The code and data are available at: https://github.com/triplej0079/DeepGAR.

Personalized Federated Learning via Heterogeneous Modular Networks

Personalized Federated Learning (PFL) which collaboratively trains a federated model while considering local clients under privacy constraints has attracted much attention. Despite its popularity, it has been observed that existing PFL approaches result in sub-optimal solutions when the joint distribution among local clients diverges. To address this issue, we present Federated Modular Network (FedMN), a novel PFL approach that adaptively selects sub-modules from a module pool to assemble heterogeneous neural architectures for different clients. FedMN adopts a light-weighted routing hypernetwork to model the joint distribution on each client and produce the personalized selection of the module blocks for each client. To reduce the communication burden in existing FL, we develop an efficient way to interact between the clients and the server. We conduct extensive experiments on the real-world test beds and the results show both effectiveness and efficiency of the proposed FedMN over the baselines.

Multi-Faceted Knowledge-Driven Pre-training for Product Representation Learning

As a key component of e-commerce computing, product representation learning (PRL) provides benefits for a variety of applications, including product matching, search, and categorization. The existing PRL approaches have poor language understanding ability due to their inability to capture contextualized semantics. In addition, the learned representations by existing methods are not easily transferable to new products. Inspired by the recent advance of pre-trained language models (PLMs), we make the attempt to adapt PLMs for PRL to mitigate the above issues. In this article, we develop KINDLE, a Knowledge-drIven pre-trainiNg framework for proDuct representation LEarning, which can preserve the contextual semantics and multi-faceted product knowledge robustly and flexibly. Specifically, we first extend traditional one-stage pre-training to a two-stage pre-training framework and exploit a deliberate knowledge encoder to ensure a smooth knowledge fusion into PLM. In addition, we propose a multi-objective heterogeneous embedding method to represent thousands of knowledge elements. This helps KINDLE calibrate knowledge noise and sparsity automatically by replacing isolated classes as training targets in knowledge acquisition tasks. Furthermore, an input-aware gating network is proposed to select the most relevant knowledge for different downstream tasks. Finally, extensive experiments have demonstrated the advantages of KINDLE over the state-of-the-art baselines across three downstream tasks.

Explainable Anomaly Detection System for Categorical Sensor Data in Internet of Things

Internet of things (IoT) applications deploy massive number of sensors to monitor the system and environment. Anomaly detection on streaming sensor data is an important task for IoT maintenance and operation. However, there are two major challenges for anomaly detection in real IoT applications: (1) many sensors report categorical values rather than numerical readings, (2) the end users may not understand the detection results, they require additional knowledge and explanations to make decision and take action. Unfortunately, most existing solutions cannot satisfy such requirements. To bridge the gap, we design and develop an eXplainable Anomaly Detection System (XADS) for categorical sensor data. XADS trains models from historical normal data and conducts online monitoring. XADS detects the anomalies in an explainable way: the system not only reports anomalies’ time periods, types, and detailed information, but also provides explanations on why they are abnormal, and what the normal data look like. Such information significantly helps the decision making for users. Moreover, XADS requires limited parameter setting in advance, yields high accuracy on detection results and comes with a user-friendly interface, making it an efficient and effective tool to monitor a wide variety of IoT applications.

Multi-source Inductive Knowledge Graph Transfer

Multi-source Inductive Knowledge Graph Transfer Large-scale information systems, such as knowledge graphs (KGs), enterprise system networks, often exhibit dynamic and complex activities. Recent research has shown that formalizing these information systems as graphs can effectively characterize the entities (nodes) and their relationships (edges). Transferring knowledge from existing well-curated source graphs can help construct the target graph of newly-deployed systems faster and better which no doubt will benefit downstream tasks such as link prediction and anomaly detection for new systems. However, current graph transferring methods are either based on a single source, which does not sufficiently consider multiple available sources, or not selectively learns from these sources. In this paper, we propose MSGT-GNN, a graph knowledge transfer model for efficient graph link prediction from multiple source graphs. MSGT-GNN consists of two components: the Intra-Graph Encoder, which embeds latent graph features of system entities into vectors, and the graph transferor, which utilizes graph attention mechanism to learn and optimize the embeddings of corresponding entities from multiple source graphs, in both node level and graph level. Experimental results on multiple real-world datasets from various domains show that MSGT-GNN outperforms other baseline approaches in the link prediction and demonstrate the merit of attentive graph knowledge transfer and the effectiveness of MSGT-GNN.

3D Histogram-Based Anomaly Detection for Categorical Sensor Data in Internet of Things

The applications of Internet-of-things (IoT) deploy a massive number of sensors to monitor the system and environment. Anomaly detection on streaming sensor data is an important task for IoT maintenance and operation. In real IoT applications, many sensors report categorical values rather than numerical readings. Unfortunately, most existing anomaly detection methods are designed only for numerical sensor data. They cannot be used to monitor the categorical sensor data. In this study, we design and develop a 3D Histogram-based Categorical Anomaly Detection (HCAD) solution to monitor categorical sensor data in IoT. HCAD constructs the histogram model by three dimensions: categorical value, event duration, and frequency. The histogram models are used to profile normal working states of IoT devices. HCAD automatically determines the range of normal data and anomaly threshold. It only requires very limited parameter setting and can be applied to a wide variety of different IoT devices. We implement HCAD and integrate it into an online monitoring system. We test the proposed solution on real IoT datasets such as telemetry data from satellite sensors, air quality data from chemical sensors, and transportation data from traffic sensors. The results of extensive experiments show that HCAD achieves higher detecting accuracy and efficiency than state-of-the-art methods.

CAT: Beyond Efficient Transformer for Content-Aware Anomaly Detection in Event Sequences

It is critical and important to detect anomalies in event sequences, which becomes widely available in many application domains. Indeed, various efforts have been made to capture abnormal patterns from event sequences through sequential pattern analysis or event representation learning. However, existing approaches usually ignore the semantic information of event content. To this end, in this paper, we propose a self-attentive encoder-decoder transformer framework, Content-Aware Transformer CAT, for anomaly detection in event sequences. In CAT, the encoder learns preamble event sequence representations with content awareness, and the decoder embeds sequences under detection into a latent space, where anomalies are distinguishable. Specifically, the event content is first fed to a content-awareness layer, generating representations of each event. The encoder accepts preamble event representation sequence, generating feature maps. In the decoder, an additional token is added at the beginning of the sequence under detection, denoting the sequence status. A one-class objective together with sequence reconstruction loss is collectively applied to train our framework under the label efficiency scheme. Furthermore, CAT is optimized under a scalable and efficient setting. Finally, extensive experiments on three real-world datasets demonstrate the superiority of CAT.

Towards Learning Disentangled Representations for Time Series

Promising progress has been made toward learning efficient time series representations in recent years, but the learned representations often lack interpretability and do not encode semantic meanings by the complex interactions of many latent factors. Learning representations that disentangle these latent factors can bring semantic-rich representations of time series and further enhance interpretability. However, directly adopting the sequential models, such as Long Short-Term Memory Variational AutoEncoder (LSTM-VAE), would encounter a Kullback?Leibler (KL) vanishing problem: the LSTM decoder often generates sequential data without efficiently using latent representations, and the latent spaces sometimes could even be independent of the observation space. And traditional disentanglement methods may intensify the trend of KL vanishing along with the disentanglement process, because they tend to penalize the mutual information between the latent space and the observations. In this paper, we propose Disentangle Time-Series, a novel disentanglement enhancement framework for time series data. Our framework achieves multi-level disentanglement by covering both individual latent factors and group semantic segments. We propose augmenting the original VAE objective by decomposing the evidence lower-bound and extracting evidence linking factorial representations to disentanglement. Additionally, we introduce a mutual information maximization term between the observation space to the latent space to alleviate the KL vanishing problem while preserving the disentanglement property. Experimental results on five real-world IoT datasets demonstrate that the representations learned by DTS achieve superior performance in various tasks with better interpretability.

SEED: Sound Event Early Detection via Evidential Uncertainty

Sound Event Early Detection (SEED) is an essential task in recognizing the acoustic environments and soundscapes. However, most of the existing methods focus on the offline sound event detection, which suffers from the over-confidence issue of early-stage event detection and usually yield unreliable results. To solve the problem, we propose a novel Polyphonic Evidential Neural Network (PENet) to model the evidential uncertainty of the class probability with Beta distribution. Specifically, we use a Beta distribution to model the distribution of class probabilities, and the evidential uncertainty enriches uncertainty representation with evidence information, which plays a central role in reliable prediction. To further improve the event detection performance, we design the backtrack inference method that utilizes both the forward and backward audio features of an ongoing event. Experiments on the DESED database show that the proposed method can simultaneously improve 13.0% and 3.8% in time delay and detection F1 score compared to the state-of-the-art methods.