Structuring a CMC corpus of political tweets in TEI: corpus features, ethics and workflow

Abstract : The CoMeRe project (CoMeRe, 2014) aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. The project gathered mono- and multimodal, synchronous and asynchronous communication data from both Internet and telecommunication networks (text chat, tweets, SMSs, forums, blogs). A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. Whereas some CMC data types were collected within the CoMeRe project, others had previously been collected and structured within different project partners’ local research teams. This meant that the project had to overcome disparities in corpus compilation choices. For this reason, the CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and decided to describe each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term ‘openness’ also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and culumative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of political tweets cmr-polititweets (Longhi et al., 2014). The corpus stemmed from a local research project on lexicon (Digital Humanities and datajournalism, supported by the Fondation of Cergy-Pontoise University). It was built starting from seven French politicians from six different political parties. In order to generate political tweets, a set of lists citing these politicians was generated (7087 lists), and lists that have tweeted at least six times and for which the description contained the word ‘politics’ were selected (120 lists in total). Finally, 2934 tweets were recovered. In order to be sure that we selected politicians’ tweets (and not, for example, those of journalists), only the accounts cited in more than 12 lists were considered; 205 politicians were tweeting. We took the last 200 tweets of each of the 205 accounts on 27 March 2014 (34,273 tweets). This allowed us to recover data that focused on the period between the two rounds of the 2014 municipal elections in France. The poster will focus, firstly, on how features specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how features including hashtags that label tweets so that other users can see tweets on the same topic, at signs that allow a user to mention or reply to other users and retweets that allow a user to repost a message from another Twitter user and share it with his own followers, were integrated into the model. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing a corpus of tweets. Finally, the workflow & multi-stage quality control process adopted during the building of the corpus will be illustrated. This was an essential aspect considering that the corpus underwent format conversions: the local research team had initially structured the corpus in XML whilst the CoMeRe project applied the IS TEI model to the corpus. The political tweets corpus is now structured and available online. Analyses have started to be carried out: some ideas have been launched in Djemili et al. (2014) but further analyses must adhere rigorously to methodologies stemming from the natural language processing (NLP) field.
Type de document :
Poster
Corpus Linguistics 2015, Jul 2015, Lancaster, United Kingdom. 2015, 〈http://ucrel.lancs.ac.uk/cl2015/〉
Liste complète des métadonnées

https://halshs.archives-ouvertes.fr/halshs-01176061
Contributeur : Ciara R. Wigham <>
Soumis le : mardi 14 juillet 2015 - 12:00:15
Dernière modification le : mardi 21 juin 2016 - 09:35:05
Document(s) archivé(s) le : jeudi 15 octobre 2015 - 10:13:37

Fichiers

LonghiWigham_CL2015_poster.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : halshs-01176061, version 1

Collections

Citation

Julien Longhi, Ciara R. Wigham. Structuring a CMC corpus of political tweets in TEI: corpus features, ethics and workflow. Corpus Linguistics 2015, Jul 2015, Lancaster, United Kingdom. 2015, 〈http://ucrel.lancs.ac.uk/cl2015/〉. 〈halshs-01176061〉

Partager

Métriques

Consultations de la notice

603

Téléchargements de fichiers

399