Catégorisation d'un corpus hétérogène de français médiéval

Abstract : We have undertaken a morpho-syntactic tagging of the 2 millions words of our corpora of medieval texts. The external and internal heterogeneity of the texts make this task a difficult one. As a result, we had to resort to a double strategy.
Since there is actually no tool adapted to our corpora, we had first to rely on a programmable tagger in order to categorize a first text. As a second step, and building on the results obtained with the first text, we produced a tagger based on contextal rule learning. Using this latter tool we subsequently tagged a second, quite "similar" (in terms of external criteria) text. The success rate was 95%. This two-step process was then used once again to tag additional texts.
The next phase will be to evaluate the heterogeneity of texts according to internal criteria. This task involves the measurement of morpho-syntactic and semantic variation in accordance with statistical methods. It will enable us to correlate internal and external heterogeneity in order to elaborate a "fine-grained" typology of texts.
Sophie Prévost, Serge Heiden, Fernande Dupuis. Catégorisation d'un corpus hétérogène de français médiéval. Actes du colloque ‘JADT 2000 : 5es Journées Internationales d'Analyse Statistique des Données Textuelles' Lausanne, 2000, 2000, p. 485-492. ⟨halshs-00087770⟩



