Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

9Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

Automatic speech recognition of a target speaker in the presence of interfering speakers remains a challenging issue. One approach to tackle this problem is target-speaker speech recognition, which conditions the recognition process on an embedding that characterizes the voice of the target speaker. This enables recognizing only the speech of the target speaker while ignoring interferences. In this work, we propose an end-to-end target-speaker speech recognition system based on a neural transducer architecture to allow streaming and on-device recognition. Moreover, a target-speaker speech recognition system should be able to detect when the target speaker is inactive and output nothing in such a case. We introduce training and decoding schemes to allow target-speaker activity detection within our proposed recognition system. We confirm experimentally that our proposed end-to-end system performs competitively to conventional cascade approaches of a target speech extraction module and a recognition module while reducing computation costs and allowing streaming decoding.

References Powered by Scopus

Long Short-Term Memory

78157Citations
N/AReaders
Get full text

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

8873Citations
N/AReaders
Get full text

Specaugment: A simple data augmentation method for automatic speech recognition

2506Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Overview of research progress on blind separation methods for single channel communication signal

2Citations
N/AReaders
Get full text

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

2Citations
N/AReaders
Get full text

VC-T: Streaming Voice Conversion Based on Neural Transducer

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Moriya, T., Sato, H., Ochiai, T., Delcroix, M., & Shinozaki, T. (2023). Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection. IEEE Access, 11, 13906–13917. https://doi.org/10.1109/ACCESS.2023.3243690

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 2

50%

Lecturer / Post doc 1

25%

Researcher 1

25%

Readers' Discipline

Tooltip

Computer Science 6

100%

Article Metrics

Tooltip
Mentions
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free