CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion
0. Contents
- Abstract
- Comparison -- Converted samples for the proposed CoDiff-VC and comparison systems on the zero-shot VC task.
- Ablation -- Converted samples for ablation systems on the zero-shot VC task.
1. Abstract
Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while maintaining the linguistic content unchanged. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation, which results in a timbre residual within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end zero-shot voice conversion framework that incorporates a speech codec and a diffusion model. Specifically, we leverage the speech codec to extract speech tokens from speech and use a diffusion model to reconstruct speech from speech tokens for high speech quality. Besides, we introduce Mix-Style layer normalization (MSLN) to disentangle speaker timbre in speech tokens and incorporate multi-scale timbre modeling in diffusion to improve speaker similarity. Finally, for better speech quality and speaker similarity, we propose dual classifier-free guidance. Objective and subjective experiments demonstrate that CoDiff-VC significantly improves speaker similarity and generates natural and higher-quality speech.

Fig.1. Overall architecture of the proposed CoDiff-VC.
2. Comparison -- Converted samples for the proposed CoDiff-VC and comparison systems on the zero-shot VC task.
taget speaker | source speech | YourTTS [1] | SEF-VC [2] | LM-VC [3] | Diff-VC [4] | CoDiff-VC |
---|---|---|---|---|---|---|
3. Ablation -- Converted samples for ablation systems on the zero-shot VC task.
taget speaker | source speech | CoDiff-VC | -MSLN | -fine-grained | -dual cfg |
---|---|---|---|---|---|
[1] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G¨olge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in Proc. ICML, 2021. [2] J. Li, Y. Guo, X. Chen, and K. Yu, “SEF-VC: Speaker embedding free zero-shot voice conversion with cross attention,” ArXiv, vol.abs/2312.08676, 2023. [3] Z. Wang, Y.-J. Chen, L. Xie, Q. Tian, and Y. Wang, “LM-VC: Zero-shot voice conversion via speech generation based on language models,” IEEE Sig. Proc. Lett., pp. 1157–1161, 2023.