Data Science and System Security

Read our publications from our Data Science & System Security researchers who aim to build novel big-data solutions and service platforms to simplify complex systems management. We develop new information technology that supports innovative applications, from big data analytics to the Internet of Things. Our experimental and theoretical research includes many data science and systems research domains including time series mining, deep learning, NLP and large language models, graph mining, signal processing, and cloud computing.

Posts

Learning Robust Representations with Graph Denoising Policy Network

Existing representation learning methods based on graph neural networks and their variants rely on the aggregation of neighborhood information, which makes it sensitive to noises in the graph, e.g. erroneous links between nodes, incorrect/missing node features. In this paper, we propose Graph Denoising Policy Network (short for GDPNet) to learn robust representations from noisy graph data through reinforcement learning. GDPNet first selects signal neighborhoods for each node, and then aggregates the information from the selected neighborhoods to learn node representations for the down-stream tasks. Specifically, in the signal neighborhood selection phase, GDPNet optimizes the neighborhood for each target node by formulating the process of removing noisy neighborhoods as a Markov decision process and learning a policy with task-specific rewards received from the representation learning phase. In the representation learning phase, GDPNet aggregates features from signal neighbors to generate node representations for down-stream tasks, and provides task-specific rewards to the signal neighbor selection phase. These two phases are jointly trained to select optimal sets of neighbors for target nodes with maximum cumulative task-specific rewards, and to learn robust representations for nodes. Experimental results on node classification task demonstrate the effectiveness of GDNet, outperforming the state-of-the-art graph representation learning methods on several well-studied datasets.

Self-Attentive Attributed Network Embedding Through Adversarial Learning

Network embedding aims to learn the low-dimensional representations/embeddings of vertices which preserve the structure and inherent properties of the networks. The resultant embeddings are beneficial to downstream tasks such as vertex classification and link prediction. A vast majority of real-world networks are coupled with a rich set of vertex attributes, which could be potentially complementary in learning better embeddings. Existing attributed network embedding models, with shallow or deep architectures, typically seek to match the representations in topology space and attribute space for each individual vertex by assuming that the samples from the two spaces are drawn uniformly. The assumption, however, can hardly be guaranteed in practice. Due to the intrinsic sparsity of sampled vertex sequences and incompleteness in vertex attributes, the discrepancy between the attribute space and the network topology space inevitably exists. Furthermore, the interactions among vertex attributes, a.k.a cross features, have been largely ignored by existing approaches. To address the above issues, in this paper, we propose Nettention, a self-attentive network embedding approach that can efficiently learn vertex embeddings on attributed network. Instead of sample-wise optimization, Nettention aggregates the two types of information through minimizing the difference between the representation distributions in the low-dimensional topology and attribute spaces. The joint inference is encapsulated in a generative adversarial training process, yielding better generalization performance and robustness. The learned distributions consider both locality-preserving and global reconstruction constraints which can be inferred from the learning of the adversarially regularized autoencoders. Additionally, a multi-head self-attention module is developed to explicitly model the attribute interactions. Extensive experiments on benchmark datasets have verified the effectiveness of the proposed Nettention model on a variety of tasks, including vertex classification and link prediction.

A Query System for Efficiently Investigating Complex Attack Behaviors for Enterprise Security

The need for countering Advanced Persistent Threat (APT) attacks has led to the solutions that ubiquitously monitor system activities in each enterprise host, and perform timely attack investigation over the monitoring data for uncovering the attack sequence. However, existing general-purpose query systems lack explicit language constructs for expressing key properties of major attack behaviors, and their semantics-agnostic design often produces inefficient execution plans for queries. To address these limitations, we build Aiql, a novel query system that is designed with novel types of domain-specific optimizations to enable efficient attack investigation. Aiql provides (1) a domain-specific data model and storage for storing the massive system monitoring data, (2) a domain-specific query language, Attack Investigation Query Language (Aiql) that integrates critical primitives for expressing major attack behaviors, and (3) an optimized query engine based on the characteristics of the data and the semantics of the query to efficiently schedule the execution. We have deployed Aiql in NEC Labs America comprising 150 hosts. In our demo, we aim to show the complete usage scenario of Aiql by (1) performing an APT attack in a controlled environment, and (2) using Aiql to investigate such attack by querying the collected system monitoring data that contains the attack traces. The audience will have the option to perform the APT attack themselves under our guidance, and interact with the system and investigate the attack via issuing queries and checking the query results through our web UI.

Heterogeneous Graph Matching Networks for Unknown Malware Detection

Information systems have widely been the target of malware attacks. Traditional signature-based malicious program detection algorithms can only detect known malware and are prone to evasion techniques such as binary obfuscation, while behavior-based approaches highly rely on the malware training samples and incur prohibitively high training cost. To address the limitations of existing techniques, we propose MatchGNet, a heterogeneous Graph Matching Network model to learn the graph representation and similarity metric simultaneously based on the invariant graph modeling of the program’s execution behaviors. We conduct a systematic evaluation of our model and show that it is accurate in detecting malicious program behavior and can help detect malware attacks with less false positives. MatchGNet outperforms the state-of-the-art algorithms in malware detection by generating 50% less false positives while keeping zero false negatives.

Spatio-Temporal Attentive RNN for Node Classification in Temporal Attributed Graphs

Node classification in graph-structured data aims to classify the nodes where labels are only available for a subset of nodes. This problem has attracted considerable research efforts in recent years. In real-world applications, both graph topology and node attributes evolve over time. Existing techniques, however, mainly focus on static graphs and lack the capability to simultaneously learn both temporal and spatial/structural features. Node classification in temporal attributed graphs is challenging for two major aspects. First, effectively modeling the spatio-temporal contextual information is hard. Second, as temporal and spatial dimensions are entangled, to learn the feature representation of one target node, it’s desirable and challenging to differentiate the relative importance of different factors, such as different neighbors and time periods. In this paper, we propose STAR, a spatio-temporal attentive recurrent network model, to deal with the above challenges. STAR extracts the vector representation of neighborhood by sampling and aggregating local neighbor nodes. It further feeds both the neighborhood representation and node attributes into a gated recurrent unit network to jointly learn the spatio-temporal contextual information. On top of that, we take advantage of the dual attention mechanism to perform a thorough analysis on the model interpretability. Extensive experiments on real datasets demonstrate the effectiveness of the STAR model.

Clairvoyant Networks

We use the term clairvoyant to refer to networks that provide on-demand visibility for any flow at any time. Traditionally, network visibility is achieved by instrumenting and passively monitoring all flows in a network. SDN networks, by design endowed with full visibility, offer another alternative to network-wide flow monitoring. Both approaches incur significant capital and operational costs to make networks clairvoyant. In this paper, we argue that we can make any existing network clairvoyant by installing one or more SDN-enabled switches and a specialized controller to support on-demand visibility. We analyze the benefits and costs of such clairvoyant networks and provide a basic design by integrating two existing mechanisms for updating paths through legacy switches with SDN, telekinesis and magnet MACs. Our evaluation on a lab testbed and through extensive simulations show that, even with a single SDN-enabled switch, operators can make any flow visible for monitoring within milliseconds, albeit at 38% average increase in path length. With as many as 2% strategically chosen legacy switches replaced with SDN switches, clairvoyant networks achieve on-demand flow visibility with negligible overhead.

Attentional Heterogeneous Graph Neural Network: Application to Program Reidentification

Program or process is an integral part of almost every IT/OT system. Can we trust the identity/ID (e.g., executable name) of the program? To avoid detection, malware may disguise itself using the ID of a legitimate program, and a system tool (e.g., PowerShell) used by the attackers may have the fake ID of another common software, which is less sensitive. However, existing intrusion detection techniques often overlook this critical program reidentification problem (i.e., checking the program’s identity). In this paper, we propose an attentional heterogeneous graph neural network model (DeepHGNN) to verify the program’s identity based on its system behaviors. The key idea is to leverage the representation learning of the heterogeneous program behavior graph to guide the reidentification process. We formulate the program reidentification as a graph classification problem and develop an effective attentional heterogeneous graph embedding algorithm to solve it. Extensive experiments — using real-world enterprise monitoring data and real attacks — demonstrate the effectiveness of DeepHGNN across multiple popular metrics and the robustness to the normal dynamic changes like program version upgrades.

Deep Co-Clustering

Co-clustering partitions instances and features simultaneously by leveraging the duality between them, and it often yields impressive performance improvement over traditional clustering algorithms. The recent development in learning deep representations has demonstrated the advantage in extracting effective features. However, the research on leveraging deep learning frameworks for co-clustering is limited for two reasons: 1) current deep clustering approaches usually decouple feature learning and cluster assignment as two separate steps, which cannot yield the task-specific feature representation; 2) existing deep clustering approaches cannot learn representations for instances and features simultaneously. In this paper, we propose a deep learning model for co-clustering called DeepCC. DeepCC utilizes the deep autoencoder for dimension reduction, and employs a variant of Gaussian Mixture Model (GMM) to infer the cluster assignments. A mutual information loss is proposed to bridge the training of instances and features. DeepCC jointly optimizes the parameters of the deep autoencoder and the mixture model in an end-to-end fashion on both the instance and the feature spaces, which can help the deep autoencoder escape from local optima and the mixture model circumvent the Expectation-Maximization (EM) algorithm. To the best of our knowledge, DeepCC is the first deep learning model for co-clustering. Experimental results on various dataseis demonstrate the effectiveness of DeepCC.

PoLPer: Process-Aware Restriction of Over-Privileged Setuid Calls in Legacy Applications

Setuid system calls enable critical functions such as user authentications and modular privileged components. Such operations must only be executed after careful validation. However, current systems do not perform rigorous checks, allowing exploitation of privileges through memory corruption vulnerabilities in privileged programs. As a solution, understanding which setuid system calls can be invoked in what context of a process allows precise enforcement of least privileges. We propose a novel comprehensive method to systematically extract and enforce least privilege of setuid system calls to prevent misuse. Our approach learns the required process contexts of setuid system calls along multiple dimensions: process hierarchy, call stack, and parameter in a process-aware way. Every setuid system call is then restricted to the per-process context by our kernel-level context enforcer. Previous approaches without process-awareness are too coarse-grained to control setuid system calls, resulting in over-privilege. Our method reduces available privileges even for identical code depending on whether it is run by a parent or a child process. We present our prototype called PoLPer which systematically discovers only required setuid system calls and effectively prevents real-world exploits targeting vulnerabilities of the setuid family of system calls in popular desktop and server software at near zero overhead.

Countering Malicious Processes with Process-DNS Association

Modern malware and cyber attacks depend heavily on DNS services to make their campaigns reliable and difficult to track. Monitoring network DNS activities and blocking suspicious domains have been proven an effective technique in countering such attacks. However, recent successful campaigns reveal that at- tackers adapt by using seemingly benign domains and public web storage services to hide malicious activity. Also, the recent support for encrypted DNS queries provides attacker easier means to hide malicious traffic from network-based DNS monitoring.We propose PDNS, an end-point DNS monitoring system based on DNS sensor deployed at each host in a network, along with a centralized backend analysis server. To detect such attacks, PDNS expands the monitored DNS activity context and examines process context which triggered that activity. Specifically, each deployed PDNS sensor matches domain name and the IP address related to the DNS query with process ID, binary signature, loaded DLLs, and code signing information of the program that initiated it. We evaluate PDNS on a DNS activity dataset collected from 126 enterprise hosts and with data from multiple malware sources. Using ML Classifiers including DNN, our results outperform most previous works with high detection accuracy: a true positive rate at 98.55% and a low false positive rate at 0.03%.