Skip to Main content Skip to Navigation
Conference papers

Analyzing TEI encoded texts with the TXM platform

Abstract : TXM ( is an open-source software platform providing tools for qualitative and quantitative content analysis of text corpora. It implements the textometric (formerly lexicometric) methods developed in France since the 1980s, as well as generally used tools of corpus search and statistical text analysis (Heiden 2010).TXM uses a TEI extension called “XML-TXM” as its native format for storing tokenized and annotated with NLP tools corpora source texts ( mediawiki/txm/index.php?title=XML-TXM). The capacity to import and correctly analyze TEI encoded texts was one of the features requested in the original design of the platform.However, the flexibility of the TEI framework (which is its force) and the variety of encoding practices make it virtually impossible to work out a universal strategy for building a properly structured corpus (i.e. compatible with the data model of the search and analysis engines) out of an arbitrary TEI encoded text or group of texts. It should nevertheless be possible to define a subset of TEI elements that would be correctly interpreted during the various stages of the corpus import process (for example, the TEI-lite tag set), to specify the minimum requirements to the document structure and to suggest a mechanism for customization. This work is being progressively carried out by the TXM development team, but it can hardly be successful without an input from the TEI community.The goal of this paper is to present the way TXM currently deals with importing TEI encoded corpora and to discuss the ways to improve this process by interpreting TEI elements in terms of the TXM data model.
Document type :
Conference papers
Complete list of metadatas

Cited literature [3 references]  Display  Hide  Download
Contributor : Alexei Lavrentiev <>
Submitted on : Wednesday, February 18, 2015 - 3:04:15 PM
Last modification on : Thursday, June 4, 2020 - 5:02:01 PM
Long-term archiving on: : Tuesday, May 19, 2015 - 10:30:46 AM


Files produced by the author(s)


  • HAL Id : halshs-01118120, version 1



Alexei Lavrentiev, Serge Heiden, Matthieu Decorde. Analyzing TEI encoded texts with the TXM platform. The Linked TEI: Text Encoding in the Web. TEI Conference and Members Meeting 2013, Oct 2013, Rome, Italy. ⟨halshs-01118120⟩



Record views


Files downloads