Gavrilova Tat'iana

Взiaлъ, възялъ, вьзял: Processing Orthographic Variation in Lexico-Grammatical Annotation of the Middle Russian Corpus of 15th–17th Centuries

Gavrilova Tat'iana, Shalganova Tat'iana, Liashevskaia Ol'ga, , (2017) "Vziala, vaziala, vyzial: Processing Orthographic Variation in Lexico-Grammatical Annotation of the Middle Russian Corpus of 15th–17th Centuries ", Vestnik Pravoslavnogo Sviato-Tikhonovskogo gumanitarnogo universiteta. Seriia III : Filologiia, 2017, vol. 51, pp. 11-20 (in Russian).

DOI of the paper: 10.15382/sturIII201751.11-20


This paper discusses the problem of heterogenous orthography in Middle Russian texts in terms of their automatic processing. The Middle Russian subcorpus of the Russian National Corpus contains documents written mainly between 1400 and 1700, when spelling variation was still wide-spread. The task of lexico-grammatical analysis is to assign a dictionary form (lemma), a part of speech indication and grammatical tags to each word form in the corpus. Traditional methods of grammatical tagging depend on the fact that there usually only one string of characters that represents the stem and the ending of each grammatical word form. Because of this, heterogenous orthography leads to errors in the work of automatic morphology analysers (taggers) if they are not provided with the module that supports orthographic variation. In this project, both relative and absolute normalisation is used. Relative normalisation involves multiplying orthographic representations of stems and endings in the grammatical dictionary according to standard rules. This is carried out on the level of (a) word endings; (b) nominative stems with regular variation, e.g. russk(ij ) / russt(ij ), keli(ja) / kel’(ja); (c) nominative stems of Church Slavonic origin, e.g. odin- / edin-; (d) verb stems with prefi xes, etc. Absolute normalisation matches characters (or character combinations) that alternate regularly in the corpus (e.g. o/1 ‘omega’, e/ѣ, шт/щ, жю/жу). Absolute normalisation is applied both to orthographic representations in the grammatical dictionary and to word forms in the text.


Middle Russian, Old Russian, Russian National Corpus, lexico-grammatical tagging, morphological analysis, spelling variation, unstable orthography, orthographic normalisation


