Cross-architecture knowledge distillation for speech enhancement: From CMGAN to Unet

Abstract

Speech enhancement plays a crucial role in improving speech clarity and intelligibility in noisy environments, benefiting applications such as telecommunications, assistive hearing, voice-activated systems, and automatic speech recognition. Although transformer and conformer-based models achieve state-of-the-art performance, their complexity makes real-time deployment on resource-constrained devices impractical. In contrast, convolutional neural networks offer efficient on-device deployment due to their lower computational demands and better hardware compatibility. To bridge the gap between these architectures, knowledge distillation has emerged as a promising model compression technique, allowing knowledge transfer from a complex teacher model to a lightweight student model. However, conventional methods typically assume architectural similarity between teacher and student models, limiting their effectiveness in cross-architecture settings. To mitigate the challenges posed by heterogeneous architectures, we introduce an auxiliary teacher model as an intermediary between the primary teacher and the student that facilitates smoother knowledge transfer by aligning intermediate representations. Our experiments on the VoiceBank+DEMAND and LibriMix datasets demonstrate that this intermediary-based KD approach significantly enhances the student model’s performance, outperforming direct knowledge transfer. To the best of our knowledge, this is the first work exploring KD across heterogeneous architectures for the SE task, paving the way for efficient yet high-performing speech enhancement models.
K., Nguyen, Khanh; H.T., Tien Nguyen, Huy Tien,
https://doi.org/10.1016/j.neucom.2025.130798