CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

0. Contents

Abstract
Comparison -- Converted samples for the proposed CoDiff-VC and comparison systems on the zero-shot VC task.
Ablation -- Converted samples for ablation systems on the zero-shot VC task.

1. Abstract

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while maintaining the linguistic content unchanged. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation, which results in a timbre residual within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end zero-shot voice conversion framework that incorporates a speech codec and a diffusion model. Specifically, we leverage the speech codec to extract speech tokens from speech and use a diffusion model to reconstruct speech from speech tokens for high speech quality. Besides, we introduce Mix-Style layer normalization (MSLN) to disentangle speaker timbre in speech tokens and incorporate multi-scale timbre modeling in diffusion to improve speaker similarity. Finally, for better speech quality and speaker similarity, we propose dual classifier-free guidance. Objective and subjective experiments demonstrate that CoDiff-VC significantly improves speaker similarity and generates natural and higher-quality speech.

Fig.1. Overall architecture of the proposed CoDiff-VC.

2. Comparison -- Converted samples for the proposed CoDiff-VC and comparison systems on the zero-shot VC task.

taget speaker	source speech	YourTTS [1]	SEF-VC [2]	LM-VC [3]	Diff-VC [4]	CoDiff-VC

3. Ablation -- Converted samples for ablation systems on the zero-shot VC task.

taget speaker	source speech	CoDiff-VC	-MSLN	-fine-grained	-dual cfg

[1] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G¨olge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in Proc. ICML, 2021.
[2] J. Li, Y. Guo, X. Chen, and K. Yu, “SEF-VC: Speaker embedding free zero-shot voice conversion with cross attention,” ArXiv, vol.abs/2312.08676, 2023.
[3] Z. Wang, Y.-J. Chen, L. Xie, Q. Tian, and Y. Wang, “LM-VC: Zero-shot voice conversion via speech generation based on language models,” IEEE Sig. Proc. Lett., pp. 1157–1161, 2023.