Ding Li works at Peking University.

Posts

Structural Temporal Graph Neural Networks for Anomaly Detection in Dynamic Graphs

Detecting anomalies in dynamic graphs is a vital task, with numerous practical applications in areas such as security, finance, and social media. Existing network embedding based methods have mostly focused on learning good node representations, whereas largely ignoring the subgraph structural changes related to the target nodes in a given time window. In this paper, we propose StrGNN, an end-to-end structural temporal Graph Neural Network model for detecting anomalous edges in dynamic graphs. In particular, we first extract the h-hop enclosing subgraph centered on the target edge and propose a node labeling function to identify the role of each node in the subgraph. Then, we leverage the graph convolution operation and Sortpooling layer to extract the fixed-size feature from each snapshot/timestamp. Based on the extracted features, we utilize the Gated Recurrent Units to capture the temporal information for anomaly detection. We fully implement StrGNN and deploy it into a real enterprise security system, and it greatly helps detect advanced threats and optimize the incident response. Extensive experiments on six benchmark datasets also demonstrate the effectiveness of StrGNN.

SIGL: Securing Software Installations Through Deep Graph Learning

Many users implicitly assume that software can only be exploited after it is installed. However, recent supply-chain attacks demonstrate that application integrity must be ensured during installation itself. We introduce SIGL, a new tool for detecting malicious behavior during software installation. SIGL collects traces of system call activity, building a data provenance graph that it analyzes using a novel autoencoder architecture with a graph long short-term memory network (graph LSTM) for the encoder and a standard multilayer perceptron for the decoder. SIGL flags suspicious installations as well as the specific installation-time processes that are likely to be malicious. Using a test corpus of 625 malicious installers containing real-world malware, we demonstrate that SIGL has a detection accuracy of 96%, outperforming similar systems from industry and academia by up to 87% in precision and recall and 45% in accuracy. We also demonstrate that SIGL can pinpoint the processes most likely to have triggered malicious behavior, works on different audit platforms and operating systems, and is robust to training data contamination and adversarial attack. It can be used with application-specific models, even in the presence of new software versions, as well as application-agnostic meta-models that encompass a wide range of applications and installers.

This is Why We Can’t Cache Nice Things: Lightning-Fast Threat Hunting using Suspicion-Based Hierarchical Storage

Recent advances in causal analysis can accelerate incident response time, but only after a causal graph of the attack has been constructed. Unfortunately, existing causal graph generation techniques are mainly offline and may take hours or days to respond to investigator queries, creating greater opportunity for attackers to hide their attack footprint, gain persistency, and propagate to other machines. To address that limitation, we present Swift, a threat investigation system that provides high-throughput causality tracking and real-time causal graph generation capabilities. We design an in-memory graph database that enables space-efficient graph storage and online causality tracking with minimal disk operations. We propose a hierarchical storage system that keeps forensically-relevant part of the causal graph in main memory while evicting rest to disk. To identify the causal graph that is likely to be relevant during the investigation, we design an asynchronous cache eviction policy that calculates the most suspicious part of the causal graph and caches only that part in the main memory. We evaluated Swift on a real-world enterprise to demonstrate how our system scales to process typical event loads and how it responds to forensic queries when security alerts occur. Results show that Swift is scalable, modular, and answers forensic queries in real-time even when analyzing audit logs containing tens of millions of events.

APTrace: A Responsive System for Agile Enterprise Level Causality Analysis

While backtracking analysis has been successful in assisting the investigation of complex security attacks, it faces a critical dependency explosion problem. To address this problem, security analysts currently need to tune backtracking analysis manually with different case-specific heuristics. However, existing systems fail to fulfill two important system requirements to achieve effective backtracking analysis. First, there need flexible abstractions to express various types of heuristics. Second, the system needs to be responsive in providing updates so that the progress of backtracking analysis can be frequently inspected, which typically involves multiple rounds of manual tuning. In this paper, we propose a novel system, APTrace, to meet both of the above requirements. As we demonstrate in the evaluation, security analysts can effectively express heuristics to reduce more than 99.5% of irrelevant events in the backtracking analysis of real-world attack cases. To improve the responsiveness of backtracking analysis, we present a novel execution-window partitioning algorithm that significantly reduces the waiting time between two consecutive updates (especially, 57 times reduction for the top 1% waiting time).

You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis

To subvert recent advances in perimeter and host security, the attacker community has developed and employed various attack vectors to make malware much more stealthy than before to penetrate the target system and prolong its presence. The advanced malware, or stealthy malware, impersonates or abuses benign applications and legitimate system tools to minimize its footprints in the target system. One example of such stealthy malware is fileless malware, which resides its malicious logic mostly in the memory of well-trusted processes. It is difficult for traditional detection tools, such as malware scanners, to detect it, as the malware normally does not expose its malicious payload in a file and hides its malicious behaviors among the benign behaviors of the processes.In this paper, we present PROVDETECTOR, a provenance-based approach for detecting stealthy malware. The intuition behind PROVDETECTOR is that although a stealthy malware may impersonate or abuse a benign process, it still exposes its malicious behaviors in the OS (operating system) level provenance. Based on this intuition, PROVDETECTOR first employs a novel selection algorithm to identify possibly malicious parts in the OS level provenance data of a process. Then, it applies a neural embedding and machine learning pipeline to automatically detect any behavior that deviates significantly from normal behaviors. We evaluate our approach on a large provenance dataset from an enterprise network and demonstrate that it achieves very high detection performance (an average F1 score of 0.974) of stealthy malware. Further, we conduct thorough interpretability studies to understand the internals of the learned machine learning models.

Temporal Context-aware Representation Learning for Question Routing

Question routing (QR) aims at recommending newly posted questions to the potential answerers who are most likely to answer the questions. The existing approaches that learn users’ expertise from their past question-answering activities usually suffer from challenges in two aspects: 1) multi-faceted expertise and 2) temporal dynamics in the answering behavior. This paper proposes a novel temporal context-aware model in multiple granularities of temporal dynamics that concurrently address the above challenges. Specifically, the temporal context-aware attention characterizes the answerer’s multi-faceted expertise in terms of the questions’ semantic and temporal information simultaneously. Moreover, the design of the multi-shift and multi-resolution module enables our model to handle temporal impact on different time granularities. Extensive experiments on six datasets from different domains demonstrate that the proposed model significantly outperforms competitive baseline models.

Progressive Processing of System-Behavioral Query

System monitoring has recently emerged as an effective way to analyze and counter advanced cyber attacks. The monitoring data records a series of system events and provides a global view of system behaviors in an organization. Querying such data to identify potential system risks and malicious behaviors helps security analysts detect and analyze abnormal system behaviors caused by attacks. However, since the data volume is huge, queries could easily run for a long time, making it difficult for system experts to obtain prompt and continuous feedback. To support interactive querying over system monitoring data, we propose ProbeQ, a system that progressively processes system-behavioral queries. It allows users to concisely compose queries that describe system behaviors and specify an update frequency to obtain partial results progressively. The query engine of ProbeQ is built based on a framework that partitions ProbeQ queries into sub-queries for parallel execution and retrieves partial results periodically based on the specified update frequency. We concretize the framework with three partition strategies that predict the workloads for sub-queries, where the adaptive workload partition strategy (AdWd) dynamically adjusts the predicted workloads for subsequent sub-queries based on the latest execution information. We evaluate the prototype system of ProbeQ on commonly used queries for suspicious behaviors over real-world system monitoring data, and the results show that the ProbeQ system can provide partial updates progressively (on average 9.1% deviation from the update frequencies) with only 1.2% execution overhead compared to the execution without progressive processing.

Heterogeneous Graph Matching Networks for Unknown Malware Detection

Information systems have widely been the target of malware attacks. Traditional signature-based malicious program detection algorithms can only detect known malware and are prone to evasion techniques such as binary obfuscation, while behavior-based approaches highly rely on the malware training samples and incur prohibitively high training cost. To address the limitations of existing techniques, we propose MatchGNet, a heterogeneous Graph Matching Network model to learn the graph representation and similarity metric simultaneously based on the invariant graph modeling of the program’s execution behaviors. We conduct a systematic evaluation of our model and show that it is accurate in detecting malicious program behavior and can help detect malware attacks with less false positives. MatchGNet outperforms the state-of-the-art algorithms in malware detection by generating 50% less false positives while keeping zero false negatives.

Attentional Heterogeneous Graph Neural Network: Application to Program Reidentification

Program or process is an integral part of almost every IT/OT system. Can we trust the identity/ID (e.g., executable name) of the program? To avoid detection, malware may disguise itself using the ID of a legitimate program, and a system tool (e.g., PowerShell) used by the attackers may have the fake ID of another common software, which is less sensitive. However, existing intrusion detection techniques often overlook this critical program reidentification problem (i.e., checking the program’s identity). In this paper, we propose an attentional heterogeneous graph neural network model (DeepHGNN) to verify the program’s identity based on its system behaviors. The key idea is to leverage the representation learning of the heterogeneous program behavior graph to guide the reidentification process. We formulate the program reidentification as a graph classification problem and develop an effective attentional heterogeneous graph embedding algorithm to solve it. Extensive experiments — using real-world enterprise monitoring data and real attacks — demonstrate the effectiveness of DeepHGNN across multiple popular metrics and the robustness to the normal dynamic changes like program version upgrades.

Countering Malicious Processes with Process-DNS Association

Modern malware and cyber attacks depend heavily on DNS services to make their campaigns reliable and difficult to track. Monitoring network DNS activities and blocking suspicious domains have been proven an effective technique in countering such attacks. However, recent successful campaigns reveal that at- tackers adapt by using seemingly benign domains and public web storage services to hide malicious activity. Also, the recent support for encrypted DNS queries provides attacker easier means to hide malicious traffic from network-based DNS monitoring.We propose PDNS, an end-point DNS monitoring system based on DNS sensor deployed at each host in a network, along with a centralized backend analysis server. To detect such attacks, PDNS expands the monitored DNS activity context and examines process context which triggered that activity. Specifically, each deployed PDNS sensor matches domain name and the IP address related to the DNS query with process ID, binary signature, loaded DLLs, and code signing information of the program that initiated it. We evaluate PDNS on a DNS activity dataset collected from 126 enterprise hosts and with data from multiple malware sources. Using ML Classifiers including DNN, our results outperform most previous works with high detection accuracy: a true positive rate at 98.55% and a low false positive rate at 0.03%.