Mix-Clap: Adaptive Fusion of Knowledge-Distilled Audio Embeddings for Noise-Aware Audio-Language Models

Publication Date: 5/4/2026

Event: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

Reference: pp. 20502-20505, 2026

Authors: Wataru Kohno, NEC Laboratories America, Inc.; Shaobo Han, NEC Laboratories America, Inc.; Noriyuki Tonami, NEC Corporation; Tingfeng Li, NEC Laboratories America, Inc.; Jingchen Sun, The State University of New York at Buffalo; Ting Wang, NEC Laboratories America, Inc.

Abstract: Real-world deployment requires sound event and acoustic scene classification systems to remain reliable in noisy, diverse environments on resource-constrained devices. Although contrastive language-audio pretraining (CLAP) models with Transformer-based audio encoders achieve strong zero-shot performance, their computational cost hinders deployment. In this paper, we propose Mix-CLAP, a computationally efficient, noise-aware CLAP model with knowledge-distilled audio encoders. Our method includes: (1) a two-stage knowledge distillation from teacher embeddings to two lightweight student encoders, one on clean audio, the other on noisy audio, and (2) adaptive inference that combines their embeddings together with a fusion parameter and minimizes the parameterized entropy at test time. Experiments show that Mix-CLAP with MobileNetV3-based audio encoders greatly improves computational efficiency, while achieving a comparable average accuracy of 52.58% to the Transformer-based CLAP model at 52.83% on the recorded ESC50 datasets with different devices, including microphones and fiber-optic distributed acoustic sensors under diverse conditions, making it suitable for real-world, resource-constrained applications.

Publication Link: https://ieeexplore.ieee.org/document/11462496