The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

Mots clés

Computer Mediated Communication corpus CoMeRe CMC

Domaines

Linguistique

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Article dans une revue
Titre	en The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres
Résumé	en The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.
Auteur(s)	Thierry Chanier ¹ , Céline Poudat ² , Benoît Sagot ³ , Georges Antoniadis ⁴ , Ciara R. Wigham ⁵ , Linda Hriba ² , Julien Longhi ⁶ , Djamé Seddah ^{3, 7} 1 LRL - Laboratoire de Recherche sur le Langage ( 229 ) - Maison des Sciences de l'Homme, 4 rue Ledru, 63057 Clermont-Ferrand Cedex 1 - France Université Blaise Pascal - Clermont-Ferrand 2 EA999 ( 205618 ) 2 LDI - Lexiques, Dictionnaires, Informatique ( 24509 ) - UFR Lettres, Sciences de l'Homme et des Sociétés, Université Paris 13, 99 avenue Jean-Baptiste Clément, F-93430, Villetaneuse - France Université Paris 13 ( 15786 ) ; Université de Cergy Pontoise ( 300305 ) ; Université Paris-Seine ( 531928 ) ; Université Sorbonne Paris Cité ( 303171 ) ; Centre National de la Recherche Scientifique UMR7187 ( 441569 ) 3 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing ( 54505 ) - Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencourt - France Inria Paris-Rocquencourt ( 86790 ) ; Institut National de Recherche en Informatique et en Automatique ( 300009 ) ; Université Paris Diderot - Paris 7 ( 300301 ) 4 LIDILEM - LInguistique et DIdactique des Langues Étrangères et Maternelles ( 26828 ) - Bâtiment Stendhal - CS40700 - 38058 Grenoble cedex 9 - France Université Stendhal - Grenoble 3 EA609 ( 5485 ) 5 ICAR - Interactions, Corpus, Apprentissages, Représentations ( 51028 ) - 5, av Pierre Mendès-France 69676 BRON CEDEX - France École normale supérieure de Lyon ( 6818 ) ; Université Lumière - Lyon 2 ( 33804 ) ; INRP ( 300042 ) ; Ecole Normale Supérieure Lettres et Sciences Humaines ( 303652 ) ; Centre National de la Recherche Scientifique UMR5191 ( 441569 ) 6 CRTF - Centre de recherche textes et francophonies ( 156391 ) - Université de Cergy-Pontoise - 33, boulevard du Port - 95011 Cergy-Pontoise cedex - France Université de Cergy Pontoise EA1392 ( 300305 ) ; Université Paris-Seine ( 531928 ) 7 ISHA - Institut des Sciences Humaines Appliquées ( 179257 ) - Maison de la Recherche 28 rue Serpente 75006 Paris - France Université Paris-Sorbonne ( 123821 )
Commentaire	Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel)
URL éditeur	http://www.jlcl.org/2014_Heft2/Heft2-2014.pdf
Page/Identifiant	1-30
Numéro	2
Volume	29
Date de production/écriture	2014-09-12
Nom de la revue	JLCL - Journal for language technology and computational linguistics (ISSN : 0175-1336, ISSN électronique : 2190-6858) GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik) Publié par GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik) https://www.uni-due.de/ub/ghbsys/jop?genre=journal&sid=bib:ughe&pid=bibid%3DUGHE&issn=2190-6858
Vulgarisation	Non
Comité de lecture	Oui
Audience	Internationale
Date de publication	2014
Langue du document	Anglais
Domaine(s)	Sciences de l'Homme et Société/Linguistique
Collaboration/Projet	Les auteurs remercient le LABEX ASLAN (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme "Investissements d'Avenir" (ANR-11-IDEX-0007) de l'Etat Français géré par l'Agence Nationale de la Recherche (ANR).
Mots-clés	en Computer Mediated Communication, corpus, CoMeRe, CMC

Fichier principal

cmr-article-jlcl-v140912-hal.pdf ( 499.34 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Thierry Chanier : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-00953507

Soumis le : vendredi 12 septembre 2014 à 16:17:18

Dernière modification le : jeudi 4 avril 2024 à 21:19:07

Archivage à long terme le : samedi 13 décembre 2014 à 10:36:15

Dates et versions

halshs-00953507, version 1 (28-02-2014)

halshs-00953507, version 2 (12-09-2014)

Identifiants

HAL Id : halshs-00953507 , version 2

Citer

Thierry Chanier, Céline Poudat, Benoît Sagot, Georges Antoniadis, Ciara R. Wigham, et al.. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. Journal for language technology and computational linguistics, 2014, 29 (2), pp.1-30. ⟨halshs-00953507v2⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

ENS-LYON UNIV-PARIS7 UNIV-PARIS13 UNIV-RENNES1 UGA PRES_CLERMONT CNRS INRIA UNIV-LYON2 UNIV-CERGY IRISA ICAR LRL LDI LIDILEM INRIA2 GENCI CAMPUS-AAR AAI UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES SORBONNE-UNIVERSITE SU-LETTRES CAMPUS-CONDORCET UDL SORBONNE-PARIS-NORD UR1-MATH-NUM ACT-R

1521 Consultations

1613 Téléchargements

Dernière date de mise à jour le 20/04/2024