Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.

Mots clés

17th century French Classical theatre Lemmatisation POS Tagging POS tagging 17th c. French Classical Theatre

Domaines

Linguistique Littératures Traitement du texte et du document Informatique et langage [cs.CL]

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Article dans une revue
Titre	en Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
Résumé	en This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.
Auteur(s)	Jean-Baptiste Camps ¹ , Simon Gabay ² , Paul Fièvre ³ , Thibault Clérice ¹ , Florian Cafiero ⁴ 1 CJM - Centre Jean Mabillon ( 241921 ) - École nationale des chartes 65 rue de Richelieu, 75002 Paris - France École nationale des chartes EA 3624 ( 307249 ) ; Université Paris Sciences et Lettres ( 564132 ) ; Université Paris Sciences et Lettres ( 564132 ) 2 UNINE - Université de Neuchâtel = University of Neuchatel ( 300663 ) - Av. du 1er-Mars 26, 2000 Neuchâtel - Suisse 3 BnF - Bibliothèque nationale de France ( 201530 ) - Quai François Mauriac, 75706 Paris Cedex 13 - France 4 LIED (UMR_8236) - Laboratoire Interdisciplinaire des Energies de Demain ( 224413 ) - Université Paris Diderot, 11 rue Alice Domon et Léonie Duquet Bât. Condorcet, case postale 7040, 75205 Paris cedex 13 - France Université Paris Diderot - Paris 7 UMR_8236 ( 300301 ) ; Centre National de la Recherche Scientifique UMR_8236 ( 441569 )
Langue du document	Anglais
Date de production/écriture	2020
Licence	Paternité - Partage selon les Conditions Initiales
Nom de la revue	JDMDH - Journal of Data Mining and Digital Humanities (ISSN électronique : 2416-5999) INRIA Publié par INRIA http://jdmdh.episciences.org/
Vulgarisation	Non
Comité de lecture	Oui
Audience	Internationale
Date de publication	2021
Domaine(s)	Sciences de l'Homme et Société/Linguistique Sciences de l'Homme et Société/Littératures Informatique [cs]/Traitement du texte et du document Informatique [cs]/Informatique et langage [cs.CL]
Classification ACM 2012	acm2012.ACM2012/Computing methodologies/Artificial intelligence/Natural language processing
Voir aussi	https://advances.sciencemag.org/content/5/11/eaax5489/
Financement	DIM Science du texte et connaissances nouvelles
Mots-clés	en 17th century French, Classical theatre, Lemmatisation, POS Tagging, POS tagging, 17th c. French, Classical Theatre
arXiv Id	submit/3179193
DOI	10.46298/jdmdh.6485

Fichier principal

Corpus_models_Classical_French_v2.pdf ( 687.39 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Jean-Baptiste Camps : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-02591388

Soumis le : vendredi 5 février 2021 à 16:26:53

Dernière modification le : vendredi 19 avril 2024 à 16:18:58

Dates et versions

halshs-02591388, version 1 (15-05-2020)

halshs-02591388, version 2 (05-02-2021)

Licence

Paternité - Partage selon les Conditions Initiales - CC BY 4.0

Identifiants

HAL Id : halshs-02591388 , version 2
ARXIV : submit/3179193
DOI : 10.46298/jdmdh.6485

Citer

Jean-Baptiste Camps, Simon Gabay, Paul Fièvre, Thibault Clérice, Florian Cafiero. Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre. Journal of Data Mining and Digital Humanities, 2021, ⟨10.46298/jdmdh.6485⟩. ⟨halshs-02591388v2⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

BNF CNRS PSL ENC CAMPUS-CONDORCET CJM UP-SCIENCES LIED

443 Consultations

915 Téléchargements

Dernière date de mise à jour le 21/04/2024