Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

We propose non-parallel and many-to-many voice conversion (VC) using variational autoencoders (VAEs) that constructs VC models for converting arbitrary speakers?characteristics into those of other arbitrary speakers without parallel speech corpora for training the models. Although VAEs conditioned by one-hot coded speaker codes can achieve non-parallel VC, the phonetic contents of the converted speech tend to vanish, resulting in degraded speech quality. Another issue is that they cannot deal with unseen speakers not included in training corpora. To overcome these issues, we incorporate deep-neural-network-based automatic speech recognition (ASR) and automatic speaker verification (ASV) into the VAE-based VC. Since phonetic contents are given as phonetic posteriorgrams predicted from the ASR models, the proposed VC can overcome the quality degradation. Our VC utilizes d-vectors extracted from the ASV models as continuous speaker representations that can deal with unseen speakers. Experimental results demonstrate that our VC outperforms the conventional VAE-based VC in terms of mel-cepstral distortion and converted speech quality. We also investigate the effects of hyperparameters in our VC and reveal that 1) a large d-vector dimensionality that gives the better ASV performance does not necessarily improve converted speech quality, and 2) a large number of pre-stored speakers improves the quality.

References Powered by Scopus

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

14549Citations
N/AReaders
Get full text

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

8828Citations
N/AReaders
Get full text

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds

1704Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Non-parallel and Many-to-One Musical Timbre Morphing using DDSP-Autoencoder and Spectral Feature Interpolation

1Citations
N/AReaders
Get full text

Voice Conversion Combining Vector Quantization and CTC Introducing Pre-Trained Representation

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Saito, Y., Nakamura, T., Ijima, Y., Nishida, K., & Takamichi, S. (2021). Non-parallel and many-to-many voice conversion using variational autoencoders integrating speech recognition and speaker verification. Acoustical Science and Technology, 42(1), 1–11. https://doi.org/10.1250/AST.42.1

Readers over time

‘19‘21‘2202468

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 2

40%

Researcher 2

40%

Professor / Associate Prof. 1

20%

Readers' Discipline

Tooltip

Computer Science 2

50%

Agricultural and Biological Sciences 1

25%

Social Sciences 1

25%

Save time finding and organizing research with Mendeley

Sign up for free
0