AlloVera: a multilingual allophone database

We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.

Mots clés

Allophones Phoneme

Domaines

Linguistique Traitement du signal et de l'image [eess.SP]

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Communication dans un congrès
Titre	en AlloVera: a multilingual allophone database
Résumé	en We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.
Auteur(s)	David R Mortensen ¹ , Xinjian Li ¹ , Patrick Littell ² , Alexis Michaud ³ , Shruti Rijhwani ¹ , Antonios Anastasopoulos ¹ , Alan Black ¹ , Florian Metze ¹ , Graham Neubig ¹ 1 CMU - Carnegie Mellon University [Pittsburgh] ( 67135 ) - 5000 Forbes Ave, Pittsburgh, PA 15213 - États-Unis 2 NRC - National Research Council of Canada ( 303485 ) - 1200 Montreal Road, Building M-58, Ottawa, Ontario K1A 0R6 - Canada 3 LACITO - Langues et civilisations à tradition orale ( 406905 ) - 7, rue Guy Môquet, 94800, VILLEJUIF - France Université Sorbonne Nouvelle - Paris 3 UMR7107 ( 52995 ) ; Institut National des Langues et Civilisations Orientales UMR7107 ( 300064 ) ; Centre National de la Recherche Scientifique UMR7107 ( 441569 )
Langue du document	Anglais
Licence	Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales
Vulgarisation	Non
Comité de lecture	Oui
Invité	Non
Audience	Internationale
Actes	Oui
Date de publication	2020
Titre de la collection	Proceedings of LREC 2020: 12th Language Resources and Evaluation Conference
Titre du congrès	LREC 2020: 12th Language Resources and Evaluation Conference
Date début congrès	2020-05-11
Date fin congrès	2020-05-15
Ville	Marseille
Pays	France
URL du congrès ou éditeur	https://lrec2020.lrec-conf.org/
Domaine(s)	Sciences de l'Homme et Société/Linguistique Sciences de l'ingénieur [physics]/Traitement du signal et de l'image [eess.SP]
Organisateur du congrès	European Language Resources Association
Projet(s) ANR	Empirical Foundations of Linguistics : data, methods, models [En savoir plus] EFL - ANR-10-LABX-0083 LABX - 2010 La documentation computationnelle des langues à l'horizon 2025 [En savoir plus] CLD2025 - ANR-19-CE38-0015 AAPG2019 - 2019
Mots-clés	en Allophones, Phoneme

Fichier principal

MultilingualAllophoneResourceforNearUniversalASR.pdf ( 266.83 Ko )

allovera.zip ( 30.55 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Alexis Michaud : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-02527046

Soumis le : mardi 31 mars 2020 à 22:12:16

Dernière modification le : mardi 2 avril 2024 à 15:48:04

Dates et versions

halshs-02527046, version 1 (31-03-2020)

Licence

Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales - CC BY 4.0

Identifiants

HAL Id : halshs-02527046 , version 1

Citer

David R Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, et al.. AlloVera: a multilingual allophone database. LREC 2020: 12th Language Resources and Evaluation Conference, European Language Resources Association, May 2020, Marseille, France. ⟨halshs-02527046⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

CNRS UNIV-PARIS3 INALCO LACITO CAMPUS-AAR AAI USPC ASIES_ET_PACIFIQUE ANR

277 Consultations

358 Téléchargements

Dernière date de mise à jour le 07/04/2024