Contrastive Language Audio Pretraining is a method of aligning audio data with language representations to improve multimodal understanding. At NEC Labs America, this approach supports speech recognition, environmental sound classification, and cross-modal retrieval. By training AI systems to link sound with descriptive text, researchers build more generalizable models that can interpret diverse signals and enhance multimodal applications across sensing and communication domains.

Posts

Mix-Clap: Adaptive Fusion of Knowledge-Distilled Audio Embeddings for Noise-Aware Audio-Language Models

Real-world deployment requires sound event and acoustic scene classification systems to remain reliable in noisy, diverse environments on resource-constrained devices. Although contrastive language-audio pretraining (CLAP) models with Transformer-based audio encoders achieve strong zero-shot performance, their computational cost hinders deployment. In this paper, we propose Mix-CLAP, a computationally efficient, noise-aware CLAP model with knowledge-distilled audio encoders. Our method includes: (1) a two-stage knowledge distillation from teacher embeddings to two lightweight student encoders?one on clean audio, the other on noisy audio, and (2) adaptive inference that combines their embeddings together with a fusion parameter and minimizes the parameterized entropy at test time. Experiments show that Mix-CLAP with MobileNetV3-based audio encoders greatly improves computational efficiency, while achieving a comparable average accuracy of 52.58% to the Transformer-based CLAP model at 52.83% on the recorded ESC50 datasets with different devices including microphones and fiber-optic distributed acoustic sensors under diverse conditions, making it suitable for real-world, resource-constrained applications.

Text-guided Device-realistic Sound Generation for Fiber-based Sound Event Classification

Recent advancements in unique acoustic sensing devices and large-scale audio recognition models have unlocked new possibilities for environmental sound monitoring and detection. However, applying pretrained models to non-conventional acoustic sensors results in performance degradation due to domain shifts, caused by differences in frequency response and noise characteristics from the original training data. In this study, we introduce a text-guided framework for generating new datasets to retrain models specifically for these non-conventional sensors efficiently. Our approach integrates text-conditional audio generative models with two additional steps: (1) selecting audio samples based on text input to match the desired sounds, and (2) applying domain transfer techniques using recorded impulse responses and background noise to simulate the characteristics of the sensors. We demonstrate this process by generating emulated signals for fiber-optic Distributed Acoustic Sensors (DAS), creating datasets similar to the recorded ESC-50 dataset. The generated signals are then used to train a classifier, which outperforms few-shot learning approaches in environmental sound classification.