The main objective of the audio deepfake detection system is to find out the artifacts within the input speech caused by the speech synthesis or voice conversion process. Recent trends in deepfake detection is to employ deep learning architectures in an end-to-end fashion to discriminate between bonafide and spoof speech signals. In deep learning, activation functions play an important role in deciding whether the neuron’s input to the network is relevant or not in the process of prediction/classification. In this work, we propose to employ a Multiple Parametric Exponential Linear Unit (MPELU) activation function with the Residual Network (ResNet) architecture. The aim of the MPELU activation function is to generalize and unify the rectified and exponential linear units. Furthermore, we adopt an Attention Rectified Linear Unit (AReLU) which through the addition of element-wise sign-based attention mechanism with a ReLU module focuses on the enhancement of positive elements and a suppression of negative ones in a data-adaptive manner. The proposed frameworks was experimented on the logical access (LA) task of ASVSpoof2019 dataset, and outperformed the systems using the standard non-learnable and learnable activation functions.
CITATION STYLE
Alam, M. S., Fathan, A., & Alam, J. (2023). Audio DeepFake Detection Employing Multiple Parametric Exponential Linear Units. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 14339 LNAI, pp. 307–321). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-48312-7_25
Mendeley helps you to discover research relevant for your work.