MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems

Publication Date: 5/17/2024

Event: The Web Conference 2024 (WWW 2024)

Reference: pp. 4107–4116, 2024

Authors: Lecheng Zheng, University of Illinois Urbana-Champaign; NEC Laboratories America, Inc.; Zhengzhang Chen, NEC Laboratories America, Inc.; Jinrui He, University of Illinois Urbana-Champaign; Haifeng Chen, NEC Laboratories America, Inc.

Abstract: Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses, and ensuring the smooth operation and management of complex systems. Previous data-driven RCA methods, particularly those employing causal discovery techniques, have primarily focused on constructing dependency or causal graphs for backtracking the root causes. However, these methods often fall short as they rely solely on data from a single modality, thereby resulting in suboptimal solutions. In this work, we propose Mulan, a unified multi-modal causal structure learning method designed to identify root causes in microservice systems. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. To explore intricate relationships across different modalities, we propose a contrastive learning-based approach to extract modality-invariant and modality-specific representations within a shared latent space. Additionally, we introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph. Finally, we employ random walk with restart to simulate system fault propagation and identify potential root causes. Extensive experiments on three real-world datasets validate the effectiveness of our proposed method.

Publication Link: https://dl.acm.org/doi/10.1145/3589334.3645442