Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Takafumi Moriya; Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Takahiro Shinozaki

Journal ArticleOPEN ACCESS

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

IEEE Access (2023) 11 13906-13917

DOI: 10.1109/ACCESS.2023.3243690

9Citations

14Readers

Abstract

Automatic speech recognition of a target speaker in the presence of interfering speakers remains a challenging issue. One approach to tackle this problem is target-speaker speech recognition, which conditions the recognition process on an embedding that characterizes the voice of the target speaker. This enables recognizing only the speech of the target speaker while ignoring interferences. In this work, we propose an end-to-end target-speaker speech recognition system based on a neural transducer architecture to allow streaming and on-device recognition. Moreover, a target-speaker speech recognition system should be able to detect when the target speaker is inactive and output nothing in such a case. We introduce training and decoding schemes to allow target-speaker activity detection within our proposed recognition system. We confirm experimentally that our proposed end-to-end system performs competitively to conventional cascade approaches of a target speech extraction module and a recognition module while reducing computation costs and allowing streaming decoding.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Moriya, T., Sato, H., Ochiai, T., Delcroix, M., & Shinozaki, T. (2023). Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection. IEEE Access, 11, 13906–13917. https://doi.org/10.1109/ACCESS.2023.3243690

Readers' Seniority

PhD / Post grad / Masters / Doc 2

50%

Lecturer / Post doc 1

25%

Researcher 1

25%

Readers' Discipline

Computer Science 6

100%

Article Metrics

Mentions

News Mentions: 1

View details >

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Abstract

Author supplied keywords

References Powered by Scopus

Long Short-Term Memory

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Specaugment: A simple data augmentation method for automatic speech recognition

Cited by Powered by Scopus

Overview of research progress on blind separation methods for single channel communication signal

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

VC-T: Streaming Voice Conversion Based on Neural Transducer

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline

Article Metrics