InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration (VLDB 2024)

Though Large Language Models (LLMs) have shown remarkable open-generation capabilities across diverse domains, they struggle with knowledge-intensive tasks. To alleviate this issue, knowledge integration methods have been proposed to enhance LLMs with domain-specific knowledge graphs using external modules. However, they suffer from data inefficiency as they require both known and unknown knowledge for fine-tuning. Thus, we study a novel problem of integrating unknown knowledge into LLMs efficiently without unnecessary overlap of known knowledge. Injecting new knowledge poses the risk of forgetting previously acquired knowledge. To tackle this, we propose a novel Infuser-Guided Knowledge Integration (InfuserKI) framework that utilizes transformer internal states to determine whether to enhance the original LLM output with additional information, thereby effectively mitigating knowledge forgetting. Evaluations on the UMLS-2.5k and MetaQA domain knowledge graphs demonstrate that InfuserKI can effectively acquire new knowledge and outperform state-of-the-art baselines by 9% and 6%, respectively, in reducing knowledge forgetting.

POND: Multi-Source Time Series Domain Adaptation with Information-Aware Prompt Tuning

Time series domain adaptation stands as a pivotal and intricate challenge with diverse applications, including but not limited to human activity recognition, sleep stage classification, and machine fault diagnosis. Despite the numerous domain adaptation techniques proposed to tackle this complex problem, they primarily focus on domain adaptation from a single source domain. Yet, it is more crucial to investigate domain adaptation from multiple domains due to the potential for greater improvements. To address this, three important challenges need to be overcome: 1). The lack of exploration to utilize domain-specific information for domain adaptation, 2). The difficulty to learn domain-specific information that changes over time, and 3). The difficulty to evaluate learned domain-specific information. In order to tackle these challenges simultaneously, in this paper, we introduce PrOmpt-based domaiN Discrimination (POND), the first framework to utilize prompts for time series domain adaptation. Specifically, to address Challenge 1, we extend the idea of prompt tuning to time series analysis and learn prompts to capture common and domain-specific information from all source domains. To handle Challenge 2, we introduce a conditional module for each source domain to generate prompts from time series input data. For Challenge 3, we propose two criteria to select good prompts, which are used to choose the most suitable source domain for domain adaptation. The efficacy and robustness of our proposed POND model are extensively validated through experiments across 50 scenarios encompassing four datasets. Experimental results demonstrate that our proposed POND model outperforms all state-of-the-art comparison methods by up to 66% on the F1-score.

Semi-Automatic Line-System Provisioning with Integrated Physical-Parameter-Aware Methodology: Field Verification and Operational Feasibility

We propose methods and an architecture to conduct measurements and optimize newly installed optical fiber line systems semi-automatically using integrated physics-aware technologies in a data center interconnection (DCI) transmission scenario. We demonstrate, for the first time to our knowledge, digital longitudinal monitoring (DLM) and optical line system (OLS) physical parameter calibration working together in real-time to extract physical link parameters for fast optical fiber line systems provisioning. Our methodology has the following advantages over traditional design: a minimized footprint at user sites, accurate estimation of the necessary optical network characteristics via complementary telemetry technologies, and the capability to conduct all operation work remotely. The last feature is crucial, as it enables remote operation to implement network design settings for immediate response to quality of transmission (QoT) degradation and reversion in the case of unforeseen problems. We successfully performed semi-automatic line system provisioning over field fiber network facilities at Duke University, Durham, North Carolina. The tasks of parameter retrieval, equipment setting optimization, and system setup/provisioning were completed within 1 h. The field operation was supervised by on-duty personnel who could access the system remotely from different time zones. By comparing Q-factor estimates calculated from the extracted link parameters with measured results from 400G transceivers, we confirmed that our methodology has a reduction in the QoT prediction errors ( 0.3 dB) over existing designs ( 0.6 dB). ©

Introducing the Trustworthy Generative AI Project: Pioneering the Future of Compositional Generation and Reasoning

We are thrilled to announce the launch of our latest research initiative, the Trustworthy Generative AI Project. This ambitious project is set to revolutionize how we interact with multimodal content by developing cutting-edge generative models capable of compositional generation and reasoning across text, images, reports, and even 3D videos.

Distantly-Supervised Joint Extraction with Noise-Robust Learning

Joint entity and relation extraction is a process that identifies entity pairs and their relations using a single model. We focus on the problem of joint extraction in distantly-labeled data,whose labels are generated by aligning entity mentions with the corresponding entity and relation tags using a knowledge base (KB). One key challenge is the presence of noisy labels arising from both incorrect entity and relation annotations, which significantly impairs the quality of supervised learning. Existing approaches, either considering only one source of noise or making decisions using external knowledge, cannot well-utilize significant information in the training data. We propose DENRL, a generalizable framework that 1) incorporates a lightweight transformer backbone into a sequence labeling scheme for joint tagging, and 2) employs a noise-robust framework that regularizes the tagging model with significant relation patterns and entity-relation dependencies, then iteratively self-adapts to instances with less noise from both sources. Surprisingly, experiments1 on two benchmark datasets show that DENRL, using merely its own parametric distribution and simple data-driven heuristics, outperforms large language model-based baselines by a large margin with better interpretability.

Spatially Informed Gene Signatures for Response to Immunotherapy in Melanoma

We aim to improve the prediction of response or resistance to immunotherapies in patients with melanoma. This goal is based on the hypothesis that current gene signatures predicting immunotherapy outcomes show only modest accuracy due to the lack of spatial information about cellular functions and molecular processes within tumors and their microenvironment.

Introducing Our New Project: Time Series Language Model for Explainable AI

Our new project, Time Series Language Model for Explainable AI, represents a significant leap forward in the field of forecasting and explainable AI. By combining advanced forecasting techniques with explainable AI, we are paving the way for a future where data-driven insights are not only accurate but also comprehensible and actionable.

Towards Counterfactual Fairness-aware Domain Generalization in Changing Environments

Recognizing domain generalization as a commonplace challenge in machine learning, data distribution might progressively evolve across a continuum of sequential domains in practical scenarios. While current methodologies primarily concentrate on bolstering model effectiveness within these new domains, they tend to neglect issues of fairness throughout the learning process. In response, we propose an innovative framework known as Disentanglement for Counterfactual Fairness-aware Domain Generalization (DCFDG). This approach adeptly removes domain-specific information and sensitive information from the embedded representation of classification features. To scrutinize the intricate interplay between semantic information, domain-specific information, and sensitive attributes, we systematically partition the exogenous factors into four latent variables. By incorporating fairness regularization, we utilize semantic information exclusively for classification purposes. Empirical validation on synthetic and authentic datasets substantiates the efficacy of our approach, demonstrating elevated accuracy levels while ensuring the preservation of fairness amidst the evolving landscape of continuous domains.

Agentic LLMs for AI Orchestration Project: Revolutionizing Complex Workflows

The development of Agentic LLMs for AI Orchestration represents a significant advancement in artificial intelligence. By seamlessly integrating computer vision, logic, and compute modules, our LLM is poised to revolutionize the way complex workflows are managed and executed. Supported by robust research and driven by innovative training methodologies, our agentic LLM sets a new standard in AI orchestration, offering unparalleled performance and adaptability.

First Field Trial of Hybrid Fiber Sensing with Data Transmission Resulting in Enhanced Sensing Sensitivity and Spatial Resolution

Optical fiber cables, initially designed for telecommunications, are increasingly repurposed for environmental monitoring using distributed fiber sensing technologies [1,2]. Distributed acoustic sensing (DAS) based on phase optical time domain reflectometry (?-OTDR) of Rayleigh backscatter enables various applications including traffic monitoring [3], railway [4] and perimeter intrusion detection [5] and cable damage detection [6], etc. The sensing range of DAS is typically limited to several tens of kilometers due to low optical signal-to-noise (OSNR) of the received backscatter. Additionally, compatibility of DAS with existing fiber infrastructure is hindered by the unidirectional operation of inline amplifiers with isolators. An alternative approach based on forward transmission was recently proposed [7, 8], which involves probing an optical fiber with a continuous wave (CW) signal and measuring either changes in received phase or the state of polarization (SOP) to detect cumulative vibration-induced strain. Unlike backscatter measurement, forward transmissions methods have longer sensing range due to higher OSNR, and is compatible with existing telecom infrastructure. However, potential challenges include limited localization accuracy, and low number of simultaneous events that can be discriminated and localized [7]. In this paper, we propose a new concept of “hybrid fiber sensing” for long-haul DWDM networks where the repeater node architecture combines DAS with forward-phase sensing (FPS), enhancing sensitivity by 32%. This approach achieves a multi-span, fine-resolution fiber sensing system. The FPS method detects vibration anomalies and coarsely localizes its position to within a fiber span. A segmented DAS then refines the position estimate and provides a precise waveform measurement. Consequently, the special resolution improves from one fiber span of 80 km to 4 m. Our scheme is validated on a test bed comprising lab spools and field fibers, demonstrating the capability to detect and monitor field construction while simultaneously supporting full C-band 400-Gb/s real-time (RT) data transmission.