Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification

Yuki Saito; Taiki Nakamura; Yusuke Ijima; Kyosuke Nishida; Shinnosuke Takamichi

Journal ArticleOPEN ACCESS

Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification

Acoustical Science and Technology (2021) 42(1) 1-11

DOI: 10.1250/AST.42.1

2Citations

8Readers

Abstract

We propose non-parallel and many-to-many voice conversion (VC) using variational autoencoders (VAEs) that constructs VC models for converting arbitrary speakers?characteristics into those of other arbitrary speakers without parallel speech corpora for training the models. Although VAEs conditioned by one-hot coded speaker codes can achieve non-parallel VC, the phonetic contents of the converted speech tend to vanish, resulting in degraded speech quality. Another issue is that they cannot deal with unseen speakers not included in training corpora. To overcome these issues, we incorporate deep-neural-network-based automatic speech recognition (ASR) and automatic speaker verification (ASV) into the VAE-based VC. Since phonetic contents are given as phonetic posteriorgrams predicted from the ASR models, the proposed VC can overcome the quality degradation. Our VC utilizes d-vectors extracted from the ASV models as continuous speaker representations that can deal with unseen speakers. Experimental results demonstrate that our VC outperforms the conventional VAE-based VC in terms of mel-cepstral distortion and converted speech quality. We also investigate the effects of hyperparameters in our VC and reveal that 1) a large d-vector dimensionality that gives the better ASV performance does not necessarily improve converted speech quality, and 2) a large number of pre-stored speakers improves the quality.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Saito, Y., Nakamura, T., Ijima, Y., Nishida, K., & Takamichi, S. (2021). Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification. Acoustical Science and Technology, 42(1), 1–11. https://doi.org/10.1250/AST.42.1

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 2

40%

Researcher 2

40%

Professor / Associate Prof. 1

20%

Readers' Discipline

Computer Science 2

50%

Agricultural and Biological Sciences 1

25%

Social Sciences 1

25%

Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification

Abstract

Author supplied keywords

References Powered by Scopus

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds

Cited by Powered by Scopus

Non-parallel and Many-to-One Musical Timbre Morphing using DDSP-Autoencoder and Spectral Feature Interpolation

Voice Conversion Combining Vector Quantization and CTC Introducing Pre-Trained Representation

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline