The University of Utah, established in 1850, is a public research university in Salt Lake City. It fosters student success by preparing individuals from diverse backgrounds for lives of impact as leaders and citizens, generating and sharing new knowledge, discoveries, and innovations, and engaging local and global communities to promote education, health, and quality of life. NEC Labs America collaborates with the University of Utah on federated reinforcement learning and secure AI for industrial optimization. Please read about our latest news and collaborative publications with the University of Utah.

Posts

Interdependent Causal Networks for Root Cause Localization

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity’s metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

APTrace: A Responsive System for Agile Enterprise Level Causality Analysis

While backtracking analysis has been successful in assisting the investigation of complex security attacks, it faces a critical dependency explosion problem. To address this problem, security analysts currently need to tune backtracking analysis manually with different case-specific heuristics. However, existing systems fail to fulfill two important system requirements to achieve effective backtracking analysis. First, there need flexible abstractions to express various types of heuristics. Second, the system needs to be responsive in providing updates so that the progress of backtracking analysis can be frequently inspected, which typically involves multiple rounds of manual tuning. In this paper, we propose a novel system, APTrace, to meet both of the above requirements. As we demonstrate in the evaluation, security analysts can effectively express heuristics to reduce more than 99.5% of irrelevant events in the backtracking analysis of real-world attack cases. To improve the responsiveness of backtracking analysis, we present a novel execution-window partitioning algorithm that significantly reduces the waiting time between two consecutive updates (especially, 57 times reduction for the top 1% waiting time).