This paper discusses the problem of heterogenous orthography in Middle Russian texts in terms of their automatic processing. The Middle Russian subcorpus of the Russian National Corpus contains documents written mainly between 1400 and 1700, when spelling variation was still wide-spread. The task of lexico-grammatical analysis is to assign a dictionary form (lemma), a part of speech indication and grammatical tags to each word form in the corpus. Traditional methods of grammatical tagging depend on the fact that there usually only one string of characters that represents the stem and the ending of each grammatical word form. Because of this, heterogenous orthography leads to errors in the work of automatic morphology analysers (taggers) if they are not provided with the module that supports orthographic variation. In this project, both relative and absolute normalisation is used. Relative normalisation involves multiplying orthographic representations of stems and endings in the grammatical dictionary according to standard rules. This is carried out on the level of (a) word endings; (b) nominative stems with regular variation, e.g. russk(ij ) / russt(ij ), keli(ja) / kel’(ja); (c) nominative stems of Church Slavonic origin, e.g. odin- / edin-; (d) verb stems with prefi xes, etc. Absolute normalisation matches characters (or character combinations) that alternate regularly in the corpus (e.g. o/1 ‘omega’, e/ѣ, шт/щ, жю/жу). Absolute normalisation is applied both to orthographic representations in the grammatical dictionary and to word forms in the text.
Middle Russian, Old Russian, Russian National Corpus, lexico-grammatical tagging, morphological analysis, spelling variation, unstable orthography, orthographic normalisation
Berdichevskis A., Eckhoff H. M., Gavrilova T. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian, in: Komp’iuternaia lingvistika i intellektual’nye tekhnologii, 15 (22).
Gavrilova T. S., Shalganova T. A., Lyashevskaya O. N., K zadache avtomaticheskoi leksiko-grammaticheskoi razmetki starorusskogo korpusa XV–XVII vv., in: Vestnik PSTGU, Series III: Philology, 2016, Vol. 47 (2), 7–25.
Jurafsky D., Martin J. H., Speech and language processing. International Edition. New Jersey.
Mishina E. I., Pichkhadze A. A., Drevnerusskiĭ podkorpus Natsional’nogo korpusa russkogo iazyka, in: Trudy Instituta russkogo iazyka RAN, 2015, 6, 99–115.
Moldovan A. M., Pamiatniki drevnerusskoĭ pis’mennosti v Natsional’nom korpuse russkogo iazyka, in: Trudy Instituta russkogo iazyka RAN, Moscow, 2015, 6, 88–98.
Piotrowski M., Natural Language Proces sing for Historical Texts. Synthesis Lectures on Human Language Technologies. Vol. 17. San Rafael, CA, 69–78.
Schmid H., Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing.
Segalovich I., A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. Proceedings of MLMTA, Las Vegas, Nevada, 273–280.
Vinokur T. G., Drevnerusskii iazyk, Old Russian Language, Moscow, 1961.
Zalizniak A. A., Grammaticheskii slovar’ russkogo iazyka: Slovoizmenenie Grammatical Dictionary of the Russian Language: Infl ection, Moscow, 1977. 4th edition: Moscow, 2003.