Mix-Clap: Adaptive Fusion of Knowledge-Distilled Audio Embeddings for Noise-Aware Audio-Language Models
Real-world deployment requires sound event and acoustic scene classification systems to remain reliable in noisy, diverse environments on resource-constrained devices. Although contrastive language-audio pretraining (CLAP) models with Transformer-based audio encoders achieve strong zero-shot performance,


