Skip to Main content Skip to Navigation
Conference papers

Лингвистическая обработка цифровых изданий русских текстов XVIII века

Abstract : This paper deals with the problems of language processing of Russian 18th century texts that occurred in the work on digital editions of the printed translation of Al’Quran (1716) and a manuscript translation of La Belle et la Bête (The Beauty and the Beast, 1758). The linguistic processing includes spelling normalization, tokenization, morphological markup and lemmatization. The work was carried out using manual pre-markup with Microsoft Word, conversion to TEI XML format and further automatic processing on the TXM platform including annotation with TreeTagger and building multi-layer transcription. In Al’Quaran edition the spelling normalization is fully automated but only the simplest cases are dealt with, while in La Belle et la Bête manual pre-markup allows generating modern form for all words.
Document type :
Conference papers
Complete list of metadata

https://halshs.archives-ouvertes.fr/halshs-03285725
Contributor : Alexei Lavrentiev Connect in order to contact the contributor
Submitted on : Tuesday, July 13, 2021 - 3:35:48 PM
Last modification on : Wednesday, July 21, 2021 - 3:50:42 AM
Long-term archiving on: : Thursday, October 14, 2021 - 7:09:27 PM

File

Lavrentiev-Kurysheva-hal.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

  • HAL Id : halshs-03285725, version 1

Citation

Alexei Lavrentiev, L Kurysheva. Лингвистическая обработка цифровых изданий русских текстов XVIII века. Corpora 2021 International Conference, Saint-Petersburg State University, Jul 2021, Saint-Petersbourg, Russia. ⟨halshs-03285725⟩

Share

Metrics

Record views

30

Files downloads

32