Zhengzhang Chen NEC Labs America

Zhengzhang Chen

Senior Researcher

Data Science and System Security

Posts

Asymmetrically Hierarchical Networks with Attentive Interactions for Interpretable Review-based Recommendation

Recently, recommender systems have been able to emit substantially improved recommendations by leveraging user-provided reviews. Existing methods typically merge all reviews of a given user (item) into a long document, and then process user and item documents in the same manner. In practice, however, these two sets of reviews are notably different: users’ reviews reflect a variety of items that they have bought and are hence very heterogeneous in their topics, while an item’s reviews pertain only to that single item and are thus topically homogeneous. In this work, we develop a novel neural network model that properly accounts for this important difference by means of asymmetric attentive modules. The user module learns to attend to only those signals that are relevant with respect to the target item, whereas the item module learns to extract the most salient contents with regard to properties of the item. Our multi-hierarchical paradigm accounts for the fact that neither are all reviews equally useful, nor are all sentences within each review equally pertinent. Extensive experimental results on a variety of real datasets demonstrate the effectiveness of our method.

Heterogeneous Graph Matching Networks for Unknown Malware Detection

Information systems have widely been the target of malware attacks. Traditional signature-based malicious program detection algorithms can only detect known malware and are prone to evasion techniques such as binary obfuscation, while behavior-based approaches highly rely on the malware training samples and incur prohibitively high training cost. To address the limitations of existing techniques, we propose MatchGNet, a heterogeneous Graph Matching Network model to learn the graph representation and similarity metric simultaneously based on the invariant graph modeling of the program’s execution behaviors. We conduct a systematic evaluation of our model and show that it is accurate in detecting malicious program behavior and can help detect malware attacks with less false positives. MatchGNet outperforms the state-of-the-art algorithms in malware detection by generating 50% less false positives while keeping zero false negatives.

Attentional Heterogeneous Graph Neural Network: Application to Program Reidentification

Program or process is an integral part of almost every IT/OT system. Can we trust the identity/ID (e.g., executable name) of the program? To avoid detection, malware may disguise itself using the ID of a legitimate program, and a system tool (e.g., PowerShell) used by the attackers may have the fake ID of another common software, which is less sensitive. However, existing intrusion detection techniques often overlook this critical program reidentification problem (i.e., checking the program’s identity). In this paper, we propose an attentional heterogeneous graph neural network model (DeepHGNN) to verify the program’s identity based on its system behaviors. The key idea is to leverage the representation learning of the heterogeneous program behavior graph to guide the reidentification process. We formulate the program reidentification as a graph classification problem and develop an effective attentional heterogeneous graph embedding algorithm to solve it. Extensive experiments — using real-world enterprise monitoring data and real attacks — demonstrate the effectiveness of DeepHGNN across multiple popular metrics and the robustness to the normal dynamic changes like program version upgrades.

NODOZE: Combatting Threat Alert Fatigue with Automated Provenance Triage

Large enterprises are increasingly relying on threat detection softwares (e.g., Intrusion Detection Systems) to allow them to spot suspicious activities. These softwares generate alerts which must be investigated by cyber analysts to figure out if they are true attacks. Unfortunately, in practice, there are more alerts than cyber analysts can properly investigate. This leads to a “threat alert fatigue” or information overload problem where cyber analysts miss true attack alerts in the noise of false alarms.In this paper, we present NoDoze to combat this challenge using contextual and historical information of generated threat alert in an enterprise. NoDoze first generates a causal dependency graph of an alert event. Then, it assigns an anomaly score to each event in the dependency graph based on the frequency with which related events have happened before in the enterprise. NoDoze then propagates those scores along the edges of the graph using a novel network diffusion algorithm and generates a subgraph with an aggregate anomaly score which is used to triage alerts. Evaluation on our dataset of 364 threat alerts shows that NoDoze decreases the volume of false alarms by 86%, saving more than 90 hours of analysts’ time, which was required to investigate those false alarms. Furthermore, NoDoze generated dependency graphs of true alerts are 2 orders of magnitude smaller than those generated by traditional tools without sacrificing the vital information needed for the investigation. Our system has a low average runtime overhead and can be deployed with any threat detection software.

Collaborative Alert Ranking for Anomaly Detection

Given a large number of low-quality heterogeneous categorical alerts collected from an anomaly detection system, how to characterize the complex relationships between different alerts and deliver trustworthy rankings to end users? While existing techniques focus on either mining alert patterns or filtering out false positive alerts, it can be more advantageous to consider the two perspectives simultaneously in order to improve detection accuracy and better understand abnormal system behaviors. In this paper, we propose CAR, a collaborative alert ranking framework that exploits both temporal and content correlations from heterogeneous categorical alerts. CAR first builds a hierarchical Bayesian model to capture both short-term and long-term dependencies in each alert sequence. Then, an entity embedding-based model is proposed to learn the content correlations between alerts via their heterogeneous categorical attributes. Finally, by incorporating both temporal and content dependencies into a unified optimization framework, CAR ranks both alerts and their corresponding alert patterns. Our experiments-using both synthetic and real-world enterprise security alert data-show that CAR can accurately identify true positive alerts and successfully reconstruct the attack scenarios at the same time.

Behavior-based Community Detection: Application to Host Assessment in Enterprise Information Networks

Behavior-based Community Detection: Application to Host Assessment in Enterprise Information Networks Community detection in complex networks is a fundamental problem that attracts much attention across various disciplines. Previous studies have been mostly focusing on external connections between nodes (i.e., topology structure) in the network whereas largely ignoring internal intricacies (i.e., local behavior) of each node. A pair of nodes without any interaction can still share similar internal behaviors. For example, in an enterprise information network, compromised computers controlled by the same intruder often demonstrate similar abnormal behaviors even if they do not connect with each other. In this paper, we study the problem of community detection in enterprise information networks, where large-scale internal events and external events coexist on each host. The discovered host communities, capturing behavioral affinity, can benefit many comparative analysis tasks such as host anomaly assessment. In particular, we propose a novel community detection framework to identify behavior-based host communities in enterprise information networks, purely based on large-scale heterogeneous event data. We continue proposing an efficient method for assessing host’s anomaly level by leveraging the detected host communities. Experimental results on enterprise networks demonstrate the effectiveness of our model.

TINET: Transferring Knowledge between Invariant Networks

The latent behavior of an information system that can exhibit extreme events, such as system faults or cyber-attacks, is complex. Recently, the invariant network has shown to be a powerful way of characterizing complex system behaviors. Structures and evolutions of the invariance network, in particular, the vanishing correlations, can shed light on identifying causal anomalies and performing system diagnosis. However, due to the dynamic and complex nature of real-world information systems, learning a reliable invariant network in a new environment often requires continuous collecting and analyzing the system surveillance data for several weeks or even months. Although the invariant networks learned from old environments have some common entities and entity relationships, these networks cannot be directly borrowed for the new environment due to the domain variety problem. To avoid the prohibitive time and resource consuming network building process, we propose TINET, a knowledge transfer based model for accelerating invariant network construction. In particular, we first propose an entity estimation model to estimate the probability of each source domain entity that can be included in the final invariant network of the target domain. Then, we propose a dependency construction model for constructing the unbiased dependency relationships by solving a two-constraint optimization problem. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of TINET. We also apply TINET to a real enterprise security system for intrusion detection. TINET achieves superior detection performance at least 20 days lead-lag time in advance with more than 75% accuracy.