Conference papers

Métopes + TXM: Integrating Text Publishing and Text Analysis Tools Based on TEI Encoding

Abstract : This paper presents an experience of creating workflows in text publishing and text corpus analysis projects that integrate, thanks to TEI encoding, two sets of tools created for different purposes. The first set of tools is called Métopes (Métodes et outils pour l’édition structurée, or Methods and tools for structured publishing). It was developed by the Pôle document numérique of the Research centre for the Humanities (MRSH) in Caen (France) and consists in a full single-source publishing toolchain. After primary editing with Microsoft Word the documents are converted using special macros to TEI, which is the core format for further editing and for all publication forms, including PDF for printing (finalized with InDesign), ePubs and online editions produced dynamically from TEI sources by the MaX tool (based on BaseX). Métopes has been adopted by a number of French academic publishers. TXM, on the other hand, is a free and open-source (GPL V3. licence) Java and C based platform for text corpus building, annotationand analysis. It includes NLP tools, search engines and visualization tools with convenient hyperlinks between distant synthetic quantitative analysis to close reading views . TXM uses TEI (with a couple of extension elements) as an internal format for encoding the text structure and all kinds of annotations. So, both Métopes and TXM rely on TEI markup. However, Métopes focuses on general text structure and on presentational aspects (e.g. it is very sensitive to white spaces), while TXM needs to perform precise linguistic analysis (e.g. tokenisation, language identification in multi-language documents). Thanks to funding from CAHIER consortium, an intern from Caen worked for three months with the TXM team in 2017. He created a set of XSLT and CSS stylesheets that make it possible to correctly parse and analyse a Métopes produced text file or corpus with TXM, and to generate high quality publications based on texts prepared for TXM analysis. In many cases both tools use the same TEI tags, which makes integration quite straightforward. In other cases, more work is necessary to ensure full compatibility (e.g. generating table of contents in TXM or supporting word-level annotation by Métopes tools). A further integration step may consist in creating a single editorial and analytical toolchain for text scholars. A simplified workflow chart of this chain is presented in the Figure 1. The work done so far is documented on a wiki page, and the scripts (Groovy and XSLT) are available for download under an LGPL license. All documentation is currently only available in French but we are interested in collaboration for its translation into English and other languages.
