Voice and face recognition are becoming omnipresent, and the need for secure biometric technologies increases as technologies like deepfake are making it increasingly harder to spot fake generated content. To improve current audio spoofing detection, we propose a curated selection of wavelet transforms based-models where, instead of the widely employed acoustic features, the Mel-spectrogram image features are decomposed through multiresolution decomposition analysis to better handle spectral information. For that, we adopt the use of median-filtering harmonic percussive source separation (HPSS), and perform a large-scale study on the application of several recent state-of-the-art computer vision models on audio anti-spoofing detection. These wavelet transforms are experimentally found to be very useful and lead to a notable performance of 4.8% EER on the ASVspoof2019 challenge logical access (LA) evaluation set. Finally, a more adversarialy robust WaveletCNN-based model is proposed.
CITATION STYLE
Fathan, A., Alam, J., & Kang, W. (2022). Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13721 LNAI, pp. 188–200). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20980-2_17
Mendeley helps you to discover research relevant for your work.