Interdependent Causal Networks for Root Cause Localization

Publication Date: 8/10/2023

Event: 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Reference: pp. 5051-5060, 2023

Authors: Dongjie Wang, NEC Laboratories America, Inc., University of Central Florida; Zhengzhang Chen, NEC Laboratories America, Inc.; Jingchao Ni, Amazon Web Services; Tong Liang, NEC Laboratories America, Inc.; Zheng Wang, University of Utah; Yanjie Fu, University of Central Florida; Haifeng Chen, NEC Laboratories America, Inc.

Abstract: The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity’s metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

Publication Link: https://dl.acm.org/doi/10.1145/3580305.3599849