Xiao Yu is a former researcher in the Data Science & System Security department of NEC Laboratories America, Inc.

Posts

SIGL: Securing Software Installations Through Deep Graph Learning

Many users implicitly assume that software can only be exploited after it is installed. However, recent supply-chain attacks demonstrate that application integrity must be ensured during installation itself. We introduce SIGL, a new tool for detecting malicious behavior during software installation. SIGL collects traces of system call activity, building a data provenance graph that it analyzes using a novel autoencoder architecture with a graph long short-term memory network (graph LSTM) for the encoder and a standard multilayer perceptron for the decoder. SIGL flags suspicious installations as well as the specific installation-time processes that are likely to be malicious. Using a test corpus of 625 malicious installers containing real-world malware, we demonstrate that SIGL has a detection accuracy of 96%, outperforming similar systems from industry and academia by up to 87% in precision and recall and 45% in accuracy. We also demonstrate that SIGL can pinpoint the processes most likely to have triggered malicious behavior, works on different audit platforms and operating systems, and is robust to training data contamination and adversarial attack. It can be used with application-specific models, even in the presence of new software versions, as well as application-agnostic meta-models that encompass a wide range of applications and installers.

This is Why We Can’t Cache Nice Things: Lightning-Fast Threat Hunting using Suspicion-Based Hierarchical Storage

Recent advances in causal analysis can accelerate incident response time, but only after a causal graph of the attack has been constructed. Unfortunately, existing causal graph generation techniques are mainly offline and may take hours or days to respond to investigator queries, creating greater opportunity for attackers to hide their attack footprint, gain persistency, and propagate to other machines. To address that limitation, we present Swift, a threat investigation system that provides high-throughput causality tracking and real-time causal graph generation capabilities. We design an in-memory graph database that enables space-efficient graph storage and online causality tracking with minimal disk operations. We propose a hierarchical storage system that keeps forensically-relevant part of the causal graph in main memory while evicting rest to disk. To identify the causal graph that is likely to be relevant during the investigation, we design an asynchronous cache eviction policy that calculates the most suspicious part of the causal graph and caches only that part in the main memory. We evaluated Swift on a real-world enterprise to demonstrate how our system scales to process typical event loads and how it responds to forensic queries when security alerts occur. Results show that Swift is scalable, modular, and answers forensic queries in real-time even when analyzing audit logs containing tens of millions of events.

Anomaly Detection on Web-User Behaviors through Deep Learning

The modern Internet has witnessed the proliferation of web applications that play a crucial role in the branding process among enterprises. Web applications provide a communication channel between potential customers and business products. However, web applications are also targeted by attackers due to sensitive information stored in these applications. Among web-related attacks, there exists a rising but more stealthy attack where attackers first access a web application on behalf of normal users based on stolen credentials. Then attackers follow a sequence of sophisticated steps to achieve the malicious purpose. Traditional security solutions fail to detect relevant abnormal behaviors once attackers login to the web application. To address this problem, we propose WebLearner, a novel system to detect abnormal web-user behaviors. As we demonstrate in the evaluation, WebLearner has an outstanding performance. In particular, it can effectively detect abnormal user behaviors with over 96% for both precision and recall rates using a reasonably small amount of normal training data.

VESSELS: Efficient and Scalable Deep Learning Prediction on Trusted Processors

Deep learning systems on the cloud are increasingly targeted by attacks that attempt to steal sensitive data. Intel SGX has been proven effective to protect the confidentiality and integrity of such data during computation. However, state-of-the-art SGX systems still suffer from substantial performance overhead induced by the limited physical memory of SGX. This limitation significantly undermines the usability of deep learning systems due to their memory-intensive characteristics.In this paper, we provide a systematic study on the inefficiency of the existing SGX systems for deep learning prediction with a focus on their memory usage. Our study has revealed two causes of the inefficiency in the current memory usage paradigm: large memory allocation and low memory reusability. Based on this insight, we present Vessels, a new system that addresses the inefficiency and overcomes the limitation on SGX memory through memory usage optimization techniques. Vessels identifies the memory allocation and usage patterns of a deep learning program through model analysis and creates a trusted execution environment with an optimized memory pool, which minimizes the memory footprint with high memory reusability. Our experiments demonstrate that, by significantly reducing the memory footprint and carefully scheduling the workloads, Vessels can achieve highly efficient and scalable deep learning prediction while providing strong data confidentiality and integrity with SGX.

You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis

To subvert recent advances in perimeter and host security, the attacker community has developed and employed various attack vectors to make malware much more stealthy than before to penetrate the target system and prolong its presence. The advanced malware, or stealthy malware, impersonates or abuses benign applications and legitimate system tools to minimize its footprints in the target system. One example of such stealthy malware is fileless malware, which resides its malicious logic mostly in the memory of well-trusted processes. It is difficult for traditional detection tools, such as malware scanners, to detect it, as the malware normally does not expose its malicious payload in a file and hides its malicious behaviors among the benign behaviors of the processes.In this paper, we present PROVDETECTOR, a provenance-based approach for detecting stealthy malware. The intuition behind PROVDETECTOR is that although a stealthy malware may impersonate or abuse a benign process, it still exposes its malicious behaviors in the OS (operating system) level provenance. Based on this intuition, PROVDETECTOR first employs a novel selection algorithm to identify possibly malicious parts in the OS level provenance data of a process. Then, it applies a neural embedding and machine learning pipeline to automatically detect any behavior that deviates significantly from normal behaviors. We evaluate our approach on a large provenance dataset from an enterprise network and demonstrate that it achieves very high detection performance (an average F1 score of 0.974) of stealthy malware. Further, we conduct thorough interpretability studies to understand the internals of the learned machine learning models.

Heterogeneous Graph Matching Networks for Unknown Malware Detection

Information systems have widely been the target of malware attacks. Traditional signature-based malicious program detection algorithms can only detect known malware and are prone to evasion techniques such as binary obfuscation, while behavior-based approaches highly rely on the malware training samples and incur prohibitively high training cost. To address the limitations of existing techniques, we propose MatchGNet, a heterogeneous Graph Matching Network model to learn the graph representation and similarity metric simultaneously based on the invariant graph modeling of the program’s execution behaviors. We conduct a systematic evaluation of our model and show that it is accurate in detecting malicious program behavior and can help detect malware attacks with less false positives. MatchGNet outperforms the state-of-the-art algorithms in malware detection by generating 50% less false positives while keeping zero false negatives.