The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

Abstract : The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.
Type de document :
Article dans une revue
JLCL - Journal for Language Technology and Computational Linguistics, 2014, 29 (2), pp.1-30. <http://www.jlcl.org/2014_Heft2/Heft2-2014.pdf>
Liste complète des métadonnées


https://halshs.archives-ouvertes.fr/halshs-00953507
Contributeur : Thierry Chanier <>
Soumis le : vendredi 12 septembre 2014 - 16:17:18
Dernière modification le : mardi 11 octobre 2016 - 15:08:12
Document(s) archivé(s) le : samedi 13 décembre 2014 - 10:36:15

Fichier

cmr-article-jlcl-v140912-hal.p...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : halshs-00953507, version 2

Citation

Thierry Chanier, Céline Poudat, Benoit Sagot, Georges Antoniadis, Ciara R. Wigham, et al.. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. JLCL - Journal for Language Technology and Computational Linguistics, 2014, 29 (2), pp.1-30. <http://www.jlcl.org/2014_Heft2/Heft2-2014.pdf>. <halshs-00953507v2>

Partager

Métriques

Consultations de
la notice

1107

Téléchargements du document

823