Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

Rémi Cardon; Natalia Grabar

Communication dans un congrès Année : 2018

Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

, (1)

Rémi Cardon

Fonction : Auteur
PersonId : 184596
IdHAL : remi-cardon

Natalia Grabar

Fonction : Auteur
PersonId : 6735
IdHAL : natalia-grabar
ORCID : 0000-0002-0237-4554
IdRef : 089015460

Savoirs, Textes, Langage (STL) - UMR 8163

Résumé

Parallel aligned sentences provide useful information for different NLP applications. Yet, this kind of data is seldom available, especially for languages other than English. We propose to exploit comparable corpora in French which are distinguished by their registers (spe-cialized and simplified versions) to detect and align parallel sentences. These corpora are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We exploit a set of features and several automatic classi-fiers. The automatic alignment reaches up to 0.93 Precision, Recall and F-measure. In order to better evaluate the method, it is applied to data in English from the SemEval STS competitions. The same features and models are applied in monolingual and cross-lingual contexts , in which they show up to 0.90 and 0.73 F-measure, respectively.

Domaines

Sciences de l'information et de la communication Sciences de l'Homme et Société

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Communication dans un congrès
Titre	en Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers
Résumé	en Parallel aligned sentences provide useful information for different NLP applications. Yet, this kind of data is seldom available, especially for languages other than English. We propose to exploit comparable corpora in French which are distinguished by their registers (spe-cialized and simplified versions) to detect and align parallel sentences. These corpora are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We exploit a set of features and several automatic classi-fiers. The automatic alignment reaches up to 0.93 Precision, Recall and F-measure. In order to better evaluate the method, it is applied to data in English from the SemEval STS competitions. The same features and models are applied in monolingual and cross-lingual contexts , in which they show up to 0.90 and 0.73 F-measure, respectively.
Auteur(s)	Rémi Cardon , Natalia Grabar ¹ 1 STL - Savoirs, Textes, Langage (STL) - UMR 8163 ( 11909 ) - Domaine Universitaire du Pont de Bois - Batiment B4 rue du Barreau - BP 60149 - 59653 VILLENEUVE D'ASCQ CEDEX - France Université de Lille ( 374570 ) ; Centre National de la Recherche Scientifique UMR8163 ( 441569 )
Langue du document	Anglais
Vulgarisation	Non
Comité de lecture	Oui
Invité	Non
Audience	Internationale
Actes	Non
Titre du congrès	LOUHI 2018:The Ninth International Workshop on Health Text Mining and Information Analysis
Date début congrès	2018-10-31
Ville	Bruxelles
Pays	Belgique
Projet(s) ANR	Communication, Literacy, Education, Accessibility, Readability [En savoir plus] CLEAR - ANR-17-CE19-0016 AAPG2017 - 2017
Domaine(s)	Sciences de l'Homme et Société/Sciences de l'information et de la communication Sciences de l'Homme et Société

Fichier principal

cardon-LOUHI2018.pdf ( 154.84 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Natalia Grabar : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-01968351

Soumis le : mercredi 2 janvier 2019 à 15:40:09

Dernière modification le : mercredi 24 janvier 2024 à 09:54:19

Archivage à long terme le : mercredi 3 avril 2019 à 15:49:00

Dates et versions

halshs-01968351, version 1 (02-01-2019)

Identifiants

HAL Id : halshs-01968351 , version 1

Citer

Rémi Cardon, Natalia Grabar. Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers. LOUHI 2018:The Ninth International Workshop on Health Text Mining and Information Analysis, Oct 2018, Bruxelles, Belgium. ⟨halshs-01968351⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

CNRS STL CAMPUS-AAR AAI UNIV-LILLE ANR

99 Consultations

80 Téléchargements

Dernière date de mise à jour le 13/04/2024

Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager