The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme

Abstract : This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficient full-text search on annotated corpora and for statistical data analysis. The architecture is based on a Java toolbox articulating a full-text search engine component with a statistical computing environment and with an original import environment able to process a large variety of data sources, including XML-TEI, and to apply embedded NLP tools to them. The platform is distributed as an open-source Eclipse project for developers and in the form of two demonstrator applications for end users: a standard application to install on a workstation and an online web application framework.
Type de document :
Communication dans un congrès
Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Yoshimoto and Yasunari Harada. 24th Pacific Asia Conference on Language, Information and Computation, Nov 2010, Sendai, Japan. Institute for Digital Enhancement of Cognitive Development, Waseda University, pp.389-398, 2010
Liste complète des métadonnées

https://halshs.archives-ouvertes.fr/halshs-00549764
Contributeur : Serge Heiden <>
Soumis le : mercredi 22 décembre 2010 - 15:08:57
Dernière modification le : mardi 21 juin 2016 - 09:33:40
Document(s) archivé(s) le : mercredi 23 mars 2011 - 02:33:15

Fichier

paclic24_sheiden.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : halshs-00549764, version 1

Collections

Citation

Serge Heiden. The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Yoshimoto and Yasunari Harada. 24th Pacific Asia Conference on Language, Information and Computation, Nov 2010, Sendai, Japan. Institute for Digital Enhancement of Cognitive Development, Waseda University, pp.389-398, 2010. <halshs-00549764>

Partager

Métriques

Consultations de
la notice

957

Téléchargements du document

441