Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

Abstract : Parallel aligned sentences provide useful information for different NLP applications. Yet, this kind of data is seldom available, especially for languages other than English. We propose to exploit comparable corpora in French which are distinguished by their registers (spe-cialized and simplified versions) to detect and align parallel sentences. These corpora are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We exploit a set of features and several automatic classi-fiers. The automatic alignment reaches up to 0.93 Precision, Recall and F-measure. In order to better evaluate the method, it is applied to data in English from the SemEval STS competitions. The same features and models are applied in monolingual and cross-lingual contexts , in which they show up to 0.90 and 0.73 F-measure, respectively.
Type de document :
Communication dans un congrès
LOUHI 2018:The Ninth International Workshop on Health Text Mining and Information Analysis, Oct 2018, Bruxelles, Belgium
Liste complète des métadonnées

https://halshs.archives-ouvertes.fr/halshs-01968351
Contributeur : Natalia Grabar <>
Soumis le : mercredi 2 janvier 2019 - 15:40:09
Dernière modification le : jeudi 7 février 2019 - 15:36:47

Fichier

cardon-LOUHI2018.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : halshs-01968351, version 1

Collections

Citation

Rémi Cardon, Natalia Grabar. Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers. LOUHI 2018:The Ninth International Workshop on Health Text Mining and Information Analysis, Oct 2018, Bruxelles, Belgium. 〈halshs-01968351〉

Partager

Métriques

Consultations de la notice

13

Téléchargements de fichiers

24