s'authentifier
rss feed
HAL : halshs-00549764, version 1

Fiche détaillée  Export this paper
24th Pacific Asia Conference on Language, Information and Computation, Sendai : Japan (2010)
The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme
Serge Heiden 1
(2010-11-04)

This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficient full-text search on annotated corpora and for statistical data analysis. The architecture is based on a Java toolbox articulating a full-text search engine component with a statistical computing environment and with an original import environment able to process a large variety of data sources, including XML-TEI, and to apply embedded NLP tools to them. The platform is distributed as an open-source Eclipse project for developers and in the form of two demonstrator applications for end users: a standard application to install on a workstation and an online web application framework.
1 :  Interactions, Corpus, Apprentissages, Représentations (ICAR)
CNRS : UMR5191 – Université Lumière - Lyon II – Ecole Normale Supérieure Lettres et Sciences Humaines – INRP – École Normale Supérieure - Lyon
ICAR3
Humanities and Social Sciences/Methods and statistics

Computer Science/Document and Text Processing

Statistics/Applications

Computer Science/Computation and Language

Computer Science/Digital Libraries

Humanities and Social Sciences/Linguistics
xml-tei corpora – search engine – statistical analysis – textometry – open-source
Liste des fichiers attachés à ce document : 
PDF
paclic24_sheiden.pdf(1 MB)