Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов

Alexei Lavrentiev; Fedor Solovyev; Margarita Suvorova; Alina Fokina; Andrey Chepovskiy

doi:10.25205/1818-7935-2018-16-3-19-31

Article dans une revue Vestnik NSU. Series: Linguistics and Intercultural Communication Année : 2018

A New Toolkit for Natural Text Processing with the TXM Platform and its Appliсation to a Corpus for Analysis of Texts Propagating Extremist Views

Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов

(1) , (2) , (3) , (4) , (4)

1
2
3
4

Alexei Lavrentiev

Fonction : Auteur
PersonId : 2718
IdHAL : alavrent
ORCID : 0000-0001-8306-3653
IdRef : 117944688

Institut d’Histoire des Représentations et des Idées dans les Modernités

Fedor Solovyev

Fonction : Auteur

Institute of Physical and Technical Informatics

Margarita Suvorova

Fonction : Auteur

Federal Research Center – Computer Science and Control, RAS

Alina Fokina

Fonction : Auteur

Vysšaja škola èkonomiki = National Research University Higher School of Economics [Moscow]

Andrey Chepovskiy

Fonction : Auteur
PersonId : 960564

Vysšaja škola èkonomiki = National Research University Higher School of Economics [Moscow]

Résumé

TXM platform provides a wide range of corpus analysis tools including correspondence analysis, clustering, lexical table construction, and parametrized subcorpus selection. The default structural unit of analysis for TXM is a token. The only TXM extension available by default is TreeTagger which performs automated morphological analysis and lemmatization during the corpus import process. However, it is possible to supply each token with a number of features enabling a more advanced text analysis. In this work we present a number of tools developed for even a more extensive, complex and flexible corpus analysis with TXM relying both on the tools previously developed by our team and on publicly available software libraries. We focus in particular on a stemming technique that uses a word structural pattern method and on noun phrase recognition that together make it possible to perform more sophisticated and powerful queries and analyses of the corpus not limited to word forms. The structural pattern stemming method is based on a set of specific language rules that allow separating a word stem from all affixes. The recognition of noun phrases is based on rules allowing the detection of subordination and coordination relations among nouns. These extensions result in the improvement of performance of statistical tools used by TXM, such as specificity scores and correspondence analysis. The new set of tools has been tested on a corpus including texts marked as «extremist» by experts along with «neutral» texts in similar domains. The corpus of approximately 900,000 words is divided into eight subcorpora: neutral texts oppose seven thematic subcorpora considered as extremist (namely aggressive, fascist, ideological, nationalistic, religious, separatist, and terroristic). The specificity analysis detects the words (or other structural units) that are significantly more or less frequent in a given subcorpus compared to the entire corpus. The specificity score for selected units can be compared across all the subcorpora in order to verify their difference or similarity. The correspondence analysis produces a chart where the subcorpora are represented as points in a two-dimensional space based on their similarity as to the frequency of selected units. All tests demonstrated a significant difference between neutral texts, on one side, and marked, on the other. Two «extremist» subcorpora, religious and ideological, demonstrated similar results and can probably be merged. These facts encourage further research on fully automatic or computer-aided expert recognition of extremist texts.

Платформа TXM предоставляет широкие возможности корпусного анализа, такие как анализ соответствий, кластеризация, построение лексических таблиц, поиск сложных лексических конструкций, выделение подкорпусов по различным параметрам. По умолчанию платформа работает со словоупотреблениями в качестве структурных единиц анализа. Она интегрирована с единственным расширением TreeTagger, позволяющим проводить лишь морфологический анализ и лемматизацию словоупотреблений. Однако пользователь может сопроводить каждое словоупотребление набором дополнительных характеристик, позволяющих существенно усложнить анализ, сделать его более гибким. В настоящей работе описывается разработанный нами набор утилит, позволяющий, опираясь как на наши собственные программные решения, так и на готовые средства анализа, расширить и усложнить анализ корпусов в платформе TXM. Особого внимания заслуживают выделение псевдоосновы в словах текста с использованием метода структурных схем и выявление именных групп в структуре текста. Эти расширения позволяют повысить эффективность таких используемых TXM методов, как анализ специфичности и анализ соответствий. В порядке апробации излагаются результаты эксперимента по анализу корпуса, содержащего тексты, оцененные экспертами как экстремистские, и «нейтральные» тексты схожей тематики (религия, политика, идеология). Все тесты показывают ярко выраженное противостояние нейтральных и маркированных текстов и позволяют на основе полученных результатов продолжить работу по автоматическому и полуавтоматическому выявлению потенциально противоправных текстов.

Domaines

Linguistique

Liste complète des métadonnées

Format du dépôt	Notice
Type de dépôt	Article dans une revue
Titre	en A New Toolkit for Natural Text Processing with the TXM Platform and its Appliсation to a Corpus for Analysis of Texts Propagating Extremist Views ru Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов
Résumé	en TXM platform provides a wide range of corpus analysis tools including correspondence analysis, clustering, lexical table construction, and parametrized subcorpus selection. The default structural unit of analysis for TXM is a token. The only TXM extension available by default is TreeTagger which performs automated morphological analysis and lemmatization during the corpus import process. However, it is possible to supply each token with a number of features enabling a more advanced text analysis. In this work we present a number of tools developed for even a more extensive, complex and flexible corpus analysis with TXM relying both on the tools previously developed by our team and on publicly available software libraries. We focus in particular on a stemming technique that uses a word structural pattern method and on noun phrase recognition that together make it possible to perform more sophisticated and powerful queries and analyses of the corpus not limited to word forms. The structural pattern stemming method is based on a set of specific language rules that allow separating a word stem from all affixes. The recognition of noun phrases is based on rules allowing the detection of subordination and coordination relations among nouns. These extensions result in the improvement of performance of statistical tools used by TXM, such as specificity scores and correspondence analysis. The new set of tools has been tested on a corpus including texts marked as «extremist» by experts along with «neutral» texts in similar domains. The corpus of approximately 900,000 words is divided into eight subcorpora: neutral texts oppose seven thematic subcorpora considered as extremist (namely aggressive, fascist, ideological, nationalistic, religious, separatist, and terroristic). The specificity analysis detects the words (or other structural units) that are significantly more or less frequent in a given subcorpus compared to the entire corpus. The specificity score for selected units can be compared across all the subcorpora in order to verify their difference or similarity. The correspondence analysis produces a chart where the subcorpora are represented as points in a two-dimensional space based on their similarity as to the frequency of selected units. All tests demonstrated a significant difference between neutral texts, on one side, and marked, on the other. Two «extremist» subcorpora, religious and ideological, demonstrated similar results and can probably be merged. These facts encourage further research on fully automatic or computer-aided expert recognition of extremist texts. ru Платформа TXM предоставляет широкие возможности корпусного анализа, такие как анализ соответствий, кластеризация, построение лексических таблиц, поиск сложных лексических конструкций, выделение подкорпусов по различным параметрам. По умолчанию платформа работает со словоупотреблениями в качестве структурных единиц анализа. Она интегрирована с единственным расширением TreeTagger, позволяющим проводить лишь морфологический анализ и лемматизацию словоупотреблений. Однако пользователь может сопроводить каждое словоупотребление набором дополнительных характеристик, позволяющих существенно усложнить анализ, сделать его более гибким. В настоящей работе описывается разработанный нами набор утилит, позволяющий, опираясь как на наши собственные программные решения, так и на готовые средства анализа, расширить и усложнить анализ корпусов в платформе TXM. Особого внимания заслуживают выделение псевдоосновы в словах текста с использованием метода структурных схем и выявление именных групп в структуре текста. Эти расширения позволяют повысить эффективность таких используемых TXM методов, как анализ специфичности и анализ соответствий. В порядке апробации излагаются результаты эксперимента по анализу корпуса, содержащего тексты, оцененные экспертами как экстремистские, и «нейтральные» тексты схожей тематики (религия, политика, идеология). Все тесты показывают ярко выраженное противостояние нейтральных и маркированных текстов и позволяют на основе полученных результатов продолжить работу по автоматическому и полуавтоматическому выявлению потенциально противоправных текстов.
Auteur(s)	Alexei Lavrentiev ¹ , Fedor Solovyev ² , Margarita Suvorova ³ , Alina Fokina ⁴ , Andrey Chepovskiy ⁴ 1 IHRIM - Institut d’Histoire des Représentations et des Idées dans les Modernités ( 453233 ) - ENS de Lyon 15 parvis René Descartes BP 7000 69342 Lyon Cedex 07 - France École normale supérieure de Lyon ( 6818 ) ; Université Lumière - Lyon 2 ( 33804 ) ; Université Jean Moulin - Lyon 3 ( 118744 ) ; Université de Lyon ( 301088 ) ; Université Blaise Pascal - Clermont-Ferrand 2 ( 205618 ) ; Université Jean Monnet - Saint-Étienne ( 300284 ) ; Université Clermont Auvergne [2017-2020] UMR5317 ( 422708 ) ; Centre National de la Recherche Scientifique UMR5317 ( 441569 ) 2 Institute of Physical and Technical Informatics ( 543512 ) - Russie 3 Federal Research Center – Computer Science and Control, RAS ( 543513 ) - Russie 4 HSE - Vysšaja škola èkonomiki = National Research University Higher School of Economics [Moscow] ( 466917 ) - 20 Myasnitskaya Ulitsa, 101000 Moscow - Russie
Langue du document	Russe
Nom de la revue	Vestnik NSU. Series: Linguistics and Intercultural Communication (ISSN : 1818-7935) Publié par Novosibirsk State University
Vulgarisation	Non
Comité de lecture	Oui
Audience	Internationale
Date de publication	2018-09-03
Date de publication électronique	2018-09-03
Volume	16
Numéro	3
Page/Identifiant	19-31
URL éditeur	https://nsu.ru/archive
Domaine(s)	Sciences de l'Homme et Société/Linguistique
DOI	10.25205/1818-7935-2018-16-3-19-31

Alexey Lavrentev : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-01880207

Soumis le : lundi 24 septembre 2018 à 15:56:04

Dernière modification le : vendredi 12 mai 2023 à 03:52:10

Dates et versions

halshs-01880207, version 1 (24-09-2018)

Identifiants

HAL Id : halshs-01880207 , version 1
DOI : 10.25205/1818-7935-2018-16-3-19-31

Citer

Alexei Lavrentiev, Fedor Solovyev, Margarita Suvorova, Alina Fokina, Andrey Chepovskiy. Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов. Vestnik NSU. Series: Linguistics and Intercultural Communication, 2018, 16 (3), pp.19-31. ⟨10.25205/1818-7935-2018-16-3-19-31⟩. ⟨halshs-01880207⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

UNIV-ST-ETIENNE ENS-LYON UNIV-LYON3 PRES_CLERMONT CNRS UNIV-LYON2 CERHAC IHRIM UDL

57 Consultations

0 Téléchargements

Dernière date de mise à jour le 21/04/2024

A New Toolkit for Natural Text Processing with the TXM Platform and its Appliсation to a Corpus for Analysis of Texts Propagating Extremist Views

Новый комплекс инструментов автоматической обработки текста для платформы TXM и его апробация на корпусе для анализа экстремистских текстов

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager