Étiquetage d'un corpus hétérogène de français médiéval : enjeux et modalités

Serge Heiden; Sophie Prévost

Chapitre d'ouvrage Romance Corpus Linguistics - Corpora and Spoken Language, Tübingen, Gunter Narr Verlag Tübingen Année : 2002

Étiquetage d'un corpus hétérogène de français médiéval : enjeux et modalités

(1) , (2)

1
2

Serge Heiden

Fonction : Auteur
PersonId : 7692
IdHAL : serge-heiden
ORCID : 0000-0003-4682-7647
IdRef : 111293383

Interactions, Corpus, Apprentissages, Représentations

Sophie Prévost

Fonction : Auteur
PersonId : 11364
IdHAL : sprevost
ORCID : 0000-0003-3623-3482
IdRef : 059781904

Langues, textes, traitement informatique, cognition

Résumé

We have undertaken a morpho-syntactic tagging of the 2.5 millions words of our corpora of medieval texts. The external and internal heterogeneity of the texts make this task a difficult one. As a result, we had to resort to a double strategy. Since there is actually no tool adapted to our corpora, we had first to rely on a programmable tagger in order to categorize a first text. As a second step, and building on the results obtained with the first text, we produced a tagger based on contextal rule learning. Using this latter tool we subsequently tagged a second, quite "similar" (in terms of external criteria) text. This two-step process was then used once again to tag additional texts.
The next phase will be to evaluate the heterogeneity of texts according to internal criteria. The correlation of internal and external heterogeneity will enable us to elaborate a "fine-grained" typology of texts.

Mots clés

étiquetage corpus hétérogène francais médiéval

Domaines

Linguistique

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Chapitre d'ouvrage
Résumé	en We have undertaken a morpho-syntactic tagging of the 2.5 millions words of our corpora of medieval texts. The external and internal heterogeneity of the texts make this task a difficult one. As a result, we had to resort to a double strategy. Since there is actually no tool adapted to our corpora, we had first to rely on a programmable tagger in order to categorize a first text. As a second step, and building on the results obtained with the first text, we produced a tagger based on contextal rule learning. Using this latter tool we subsequently tagged a second, quite "similar" (in terms of external criteria) text. This two-step process was then used once again to tag additional texts.<br />The next phase will be to evaluate the heterogeneity of texts according to internal criteria. The correlation of internal and external heterogeneity will enable us to elaborate a "fine-grained" typology of texts.
Titre	fr Étiquetage d'un corpus hétérogène de français médiéval : enjeux et modalités
Auteur(s)	Serge Heiden ¹ , Sophie Prévost ² 1 ICAR - Interactions, Corpus, Apprentissages, Représentations ( 51028 ) - 5, av Pierre Mendès-France 69676 BRON CEDEX - France École normale supérieure de Lyon ( 6818 ) ; Université Lumière - Lyon 2 ( 33804 ) ; INRP ( 300042 ) ; Ecole Normale Supérieure Lettres et Sciences Humaines ( 303652 ) ; Centre National de la Recherche Scientifique UMR5191 ( 441569 ) 2 LaTTice - Langues, textes, traitement informatique, cognition ( 1242 ) - 1 rue Maurice Arnoux 92120 Montrouge - France École normale supérieure - Paris ( 59704 ) ; Université Paris Sciences et Lettres ( 564132 ) ; Université Paris Diderot - Paris 7 ( 300301 ) ; Centre National de la Recherche Scientifique UMR8094 ( 441569 )
Vulgarisation	Non
Langue du document	Français
Nom de la revue	Romance Corpus Linguistics - Corpora and Spoken Language, Tübingen, Gunter Narr Verlag Tübingen
Titre de l'ouvrage	Romance Corpus Linguistics - Corpora and Spoken Language, Tübingen, Gunter Narr Verlag Tübingen
Audience	Non spécifiée
Date de publication	2002
Page/Identifiant	p. 127-136
Titre du congrès	Romance Corpus Linguistics - Corpora and Spoken Language, Tübingen, Gunter Narr Verlag Tübingen
Domaine(s)	Sciences de l'Homme et Société/Linguistique
Éditeur scientifique	C.D. Pusch et W. Raible
Mots-clés	fr étiquetage, corpus hétérogène, francais médiéval

Fichier principal

prevost-biblio11.pdf ( 144.02 Ko )

Sophie Prévost : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-00087995

Soumis le : jeudi 27 juillet 2006 à 19:31:35

Dernière modification le : vendredi 19 avril 2024 à 16:18:55

Archivage à long terme le : lundi 5 avril 2010 à 22:28:23

Dates et versions

halshs-00087995, version 1 (27-07-2006)

Identifiants

HAL Id : halshs-00087995 , version 1

Citer

Serge Heiden, Sophie Prévost. Étiquetage d'un corpus hétérogène de français médiéval : enjeux et modalités. C.D. Pusch et W. Raible. Romance Corpus Linguistics - Corpora and Spoken Language, Tübingen, Gunter Narr Verlag Tübingen, p. 127-136, 2002. ⟨halshs-00087995⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

ENS-LYON UNIV-PARIS7 ENS-PARIS CNRS UNIV-LYON2 ICAR CAMPUS-AAR AAI PSL UDL

179 Consultations

638 Téléchargements

Dernière date de mise à jour le 20/04/2024

Étiquetage d'un corpus hétérogène de français médiéval : enjeux et modalités

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager