Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов

Abstract : TXM platform provides a wide range of corpus analysis tools including correspondence analysis, clustering, lexical table construction, and parametrized subcorpus selection. The default structural unit of analysis for TXM is a token. The only TXM extension available by default is TreeTagger which performs automated morphological analysis and lemmatization during the corpus import process. However, it is possible to supply each token with a number of features enabling a more advanced text analysis. In this work we present a number of tools developed for even a more extensive, complex and flexible corpus analysis with TXM relying both on the tools previously developed by our team and on publicly available software libraries. We focus in particular on a stemming technique that uses a word structural pattern method and on noun phrase recognition that together make it possible to perform more sophisticated and powerful queries and analyses of the corpus not limited to word forms. The structural pattern stemming method is based on a set of specific language rules that allow separating a word stem from all affixes. The recognition of noun phrases is based on rules allowing the detection of subordination and coordination relations among nouns. These extensions result in the improvement of performance of statistical tools used by TXM, such as specificity scores and correspondence analysis. The new set of tools has been tested on a corpus including texts marked as «extremist» by experts along with «neutral» texts in similar domains. The corpus of approximately 900,000 words is divided into eight subcorpora: neutral texts oppose seven thematic subcorpora considered as extremist (namely aggressive, fascist, ideological, nationalistic, religious, separatist, and terroristic). The specificity analysis detects the words (or other structural units) that are significantly more or less frequent in a given subcorpus compared to the entire corpus. The specificity score for selected units can be compared across all the subcorpora in order to verify their difference or similarity. The correspondence analysis produces a chart where the subcorpora are represented as points in a two-dimensional space based on their similarity as to the frequency of selected units. All tests demonstrated a significant difference between neutral texts, on one side, and marked, on the other. Two «extremist» subcorpora, religious and ideological, demonstrated similar results and can probably be merged. These facts encourage further research on fully automatic or computer-aided expert recognition of extremist texts.
Document type :
Journal articles
Complete list of metadatas

Contributor : Alexei Lavrentiev <>
Submitted on : Monday, September 24, 2018 - 3:56:04 PM
Last modification on : Thursday, February 7, 2019 - 3:45:21 PM



Alexei Lavrentiev, Fedor Solovyev, Margarita Suvorova, Alina Fokina, Andrey Chepovskiy. Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов. Vestnik NSU. Series: Linguistics and Intercultural Communication, Novosibirsk State University, 2018, 16 (3), pp.19-31. ⟨https://nsu.ru/archive⟩. ⟨10.25202/1818-7935-2018-16-3-19-31⟩. ⟨halshs-01880207⟩



Record views