Analyzing TEI encoded texts with the TXM platform

Alexei Lavrentiev; Serge Heiden; Matthieu Decorde

Communication dans un congrès Année : 2013

Analyzing TEI encoded texts with the TXM platform

(1) , (1) , (1)

Alexei Lavrentiev

Fonction : Auteur
PersonId : 2718
IdHAL : alavrent
ORCID : 0000-0001-8306-3653
IdRef : 117944688

Interactions, Corpus, Apprentissages, Représentations

Serge Heiden

Fonction : Auteur
PersonId : 7692
IdHAL : serge-heiden
ORCID : 0000-0003-4682-7647
IdRef : 111293383

Interactions, Corpus, Apprentissages, Représentations

Matthieu Decorde

Fonction : Auteur
PersonId : 734637
IdHAL : matthieu-decorde
ORCID : 0000-0002-1004-6104

Interactions, Corpus, Apprentissages, Représentations

Résumé

TXM (http://sf.net/projects/txm) is an open-source software platform providing tools for qualitative and quantitative content analysis of text corpora. It implements the textometric (formerly lexicometric) methods developed in France since the 1980s, as well as generally used tools of corpus search and statistical text analysis (Heiden 2010).TXM uses a TEI extension called “XML-TXM” as its native format for storing tokenized and annotated with NLP tools corpora source texts (http://sourceforge.net/apps/ mediawiki/txm/index.php?title=XML-TXM). The capacity to import and correctly analyze TEI encoded texts was one of the features requested in the original design of the platform.However, the flexibility of the TEI framework (which is its force) and the variety of encoding practices make it virtually impossible to work out a universal strategy for building a properly structured corpus (i.e. compatible with the data model of the search and analysis engines) out of an arbitrary TEI encoded text or group of texts. It should nevertheless be possible to define a subset of TEI elements that would be correctly interpreted during the various stages of the corpus import process (for example, the TEI-lite tag set), to specify the minimum requirements to the document structure and to suggest a mechanism for customization. This work is being progressively carried out by the TXM development team, but it can hardly be successful without an input from the TEI community.The goal of this paper is to present the way TXM currently deals with importing TEI encoded corpora and to discuss the ways to improve this process by interpreting TEI elements in terms of the TXM data model.

Mots clés

Textometry TEI digital philology

Domaines

Linguistique

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Communication dans un congrès
Titre	en Analyzing TEI encoded texts with the TXM platform
Résumé	en TXM (http://sf.net/projects/txm) is an open-source software platform providing tools for qualitative and quantitative content analysis of text corpora. It implements the textometric (formerly lexicometric) methods developed in France since the 1980s, as well as generally used tools of corpus search and statistical text analysis (Heiden 2010).TXM uses a TEI extension called “XML-TXM” as its native format for storing tokenized and annotated with NLP tools corpora source texts (http://sourceforge.net/apps/ mediawiki/txm/index.php?title=XML-TXM). The capacity to import and correctly analyze TEI encoded texts was one of the features requested in the original design of the platform.However, the flexibility of the TEI framework (which is its force) and the variety of encoding practices make it virtually impossible to work out a universal strategy for building a properly structured corpus (i.e. compatible with the data model of the search and analysis engines) out of an arbitrary TEI encoded text or group of texts. It should nevertheless be possible to define a subset of TEI elements that would be correctly interpreted during the various stages of the corpus import process (for example, the TEI-lite tag set), to specify the minimum requirements to the document structure and to suggest a mechanism for customization. This work is being progressively carried out by the TXM development team, but it can hardly be successful without an input from the TEI community.The goal of this paper is to present the way TXM currently deals with importing TEI encoded corpora and to discuss the ways to improve this process by interpreting TEI elements in terms of the TXM data model.
Auteur(s)	Alexei Lavrentiev ¹ , Serge Heiden ¹ , Matthieu Decorde ¹ 1 ICAR - Interactions, Corpus, Apprentissages, Représentations ( 51028 ) - 5, av Pierre Mendès-France 69676 BRON CEDEX - France École normale supérieure de Lyon ( 6818 ) ; Université Lumière - Lyon 2 ( 33804 ) ; INRP ( 300042 ) ; Ecole Normale Supérieure Lettres et Sciences Humaines ( 303652 ) ; Centre National de la Recherche Scientifique UMR5191 ( 441569 )
URL du congrès ou éditeur	http://digilab2.let.uniroma1.it/teiconf2013/program/papers/abstracts-paper#C139
Pays	Italie
Ville	Rome
Date fin congrès	2013-10-05
Date début congrès	2013-10-02
Comité de lecture	Oui
Invité	Non
Audience	Internationale
Actes	Non
Date de publication	2013-10-02
Titre du congrès	The Linked TEI: Text Encoding in the Web. TEI Conference and Members Meeting 2013
Vulgarisation	Non
Langue du document	Anglais
Collaboration/Projet	Les auteurs remercient le LABEX ASLAN (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme "Investissements d'Avenir" (ANR-11-IDEX-0007) de l'Etat Français géré par l'Agence Nationale de la Recherche (ANR).
Domaine(s)	Sciences de l'Homme et Société/Linguistique
Mots-clés	en Textometry, TEI, digital philology

Fichier principal

Lavrentev-etal_teimm2013-hal.pdf ( 66.81 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Alexey Lavrentev : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-01118120

Soumis le : mercredi 18 février 2015 à 15:04:15

Dernière modification le : vendredi 12 mai 2023 à 04:05:25

Archivage à long terme le : mardi 19 mai 2015 à 10:30:46

Dates et versions

halshs-01118120, version 1 (18-02-2015)

Identifiants

HAL Id : halshs-01118120 , version 1

Citer

Alexei Lavrentiev, Serge Heiden, Matthieu Decorde. Analyzing TEI encoded texts with the TXM platform. The Linked TEI: Text Encoding in the Web. TEI Conference and Members Meeting 2013, Oct 2013, Rome, Italy. ⟨halshs-01118120⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

ENS-LYON CNRS UNIV-LYON2 ICAR UDL

330 Consultations

414 Téléchargements

Dernière date de mise à jour le 20/04/2024

Analyzing TEI encoded texts with the TXM platform

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager