Skip to Main content Skip to Navigation
Conference papers

Deep learning and voice comparison: phonetically-motivated vs. automatically-learned features

Abstract : Broadband spectrograms of French vowels /Ã/, /a/, /E/, /e/, /i/, /@/, and /O/ extracted from radio broadcast corpora were used to recognize 45 speakers with a deep convolutional neural network (CNN). The same network was also trained with 62 phonetic parameters to i) see if the resulting confusions were identical to those made by the CNN trained with spectrograms, and ii) understand which acoustic parameters were used by the network. The two networks had identical discrimination results 68% of the time. In 22% of the data, the network trained with spectrograms achieved successful discrimination while the network trained with phonetic parameters failed, and the reverse was found in 10% of the data. We display the relevant phonetic parameters with raw values and values relative to the speakers' means and show cases favouring bad discrimination results. When the network trained with spectrograms failed to discriminate between some tokens, parameters related to f0 proved significant.
Document type :
Conference papers
Complete list of metadata

Cited literature [15 references]  Display  Hide  Download
Contributor : Emmanuel Ferragne Connect in order to contact the contributor
Submitted on : Monday, December 16, 2019 - 10:35:11 AM
Last modification on : Tuesday, October 19, 2021 - 2:23:30 PM
Long-term archiving on: : Tuesday, March 17, 2020 - 3:36:55 PM


Files produced by the author(s)


  • HAL Id : halshs-02412947, version 1


Cédric Gendrot, Emmanuel Ferragne, Thomas Pellegrini. Deep learning and voice comparison: phonetically-motivated vs. automatically-learned features. ICPhS, Aug 2019, Melbourne, Australia. ⟨halshs-02412947⟩



Les métriques sont temporairement indisponibles