HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Лингвистическая обработка цифровых изданий русских текстов XVIII века

Abstract : This paper deals with the problems of language processing of Russian 18th century texts that occurred in the work on digital editions of the printed translation of Al’Quran (1716) and a manuscript translation of La Belle et la Bête (The Beauty and the Beast, 1758). The linguistic processing includes spelling normalization, tokenization, morphological markup and lemmatization. The work was carried out using manual pre-markup with Microsoft Word, conversion to TEI XML format and further automatic processing on the TXM platform including annotation with TreeTagger and building multi-layer transcription. In Al’Quaran edition the spelling normalization is fully automated but only the simplest cases are dealt with, while in La Belle et la Bête manual pre-markup allows generating modern form for all words.
Document type :
Conference papers
Complete list of metadata

Contributor : Alexei Lavrentiev Connect in order to contact the contributor
Submitted on : Tuesday, July 13, 2021 - 3:35:48 PM
Last modification on : Tuesday, January 4, 2022 - 6:10:52 AM
Long-term archiving on: : Thursday, October 14, 2021 - 7:09:27 PM


Files produced by the author(s)


Distributed under a Creative Commons Attribution 4.0 International License


  • HAL Id : halshs-03285725, version 1


Alexei Lavrentiev, L Kurysheva. Лингвистическая обработка цифровых изданий русских текстов XVIII века. Corpora 2021 International Conference, Saint-Petersburg State University, Jul 2021, Saint-Petersbourg, Russia. ⟨halshs-03285725⟩



Record views


Files downloads