The paper discusses two approaches to the automatic lexico-grammatical tagging of the Middle Russian texts (1400–1700), included in the Russian National Corpus (RNC). The task is to assign each token a part of speech label, a tuple of grammatical features, and a lemma (without disambiguation). Middle Russian combines, on the one hand, features of the earlier state of the grammatical system, including aorist and imperfect verb forms, the dual number, a number of archaic inflectional paradigms, and, on the other hand, features of modern Russian inflectional morphology. In lexicon, we can see the same mix of Old Russian and Modern Russian lemmas. Moreover, the texts can contain Church Slavonic and dialectal forms. Absence of a standardised orthography and absence of a standard variant pose even more challenges to processing Middle Russian texts.
The first approach is based on writing an electronic dictionary of Old Russian and building a module to handle spelling inconsistency. In the absence of open electronic resources for Middle Russian morphology, an electronic dictionary of Church Slavonic was expanded and adapted to Middle Russian. The paper describes the steps required to change nominal and verbal entries in this dictionary. We follow the principle of «a wider expansion» which presupposes that the analyser is allowed to generate as many annotations as possible so that at least one annotation would be correct.
The second approach uses, firstly, an existing Modern Russian tagger supplemented by the module reducing spelling variation, and secondly, a database of lexico-grammatical annotations retrieved from the Diachronic corpus of the RNC.
We evaluate the output of both analysers against a manually annotated test data. We also discuss the benchmark scores and outline future prospects for the development of the Middle Russian taggers.
Middle Russian, Russian National Corpus, lexico-grammatical tagging, morphological analysis, grammatical dictionary, spelling variation, nominal infl ection, verb infl ection.
1. Avanesov R. I., Ivanov V. V., Silina V. B. (eds.) Istoricheskaja grammatika russkogo jazyka: morfologija; glagol (Historical Grammar of Russian Language: Morphology; Verb), Moscow, 1982.
2. Arhangel'skij T. A. Principy postroenija morfologicheskogo parsera dlja raznostrukturnyh jazykov. Diss... kand filol. nauk (Principles of Building of Morphological Parser for Languages with Different Structure. Dissertation), Moscow, 2012.
3. Dem'janov V. G. 2000 “Vesti-Kuranty: 1. Izdanie dlja issledovanija. 2. Issledovanie dlja izdanija (News: 1. Edition for Study. 2. Study for Edition)”, in Lingvisticheskoe istochnikovedenie i istorija russkogo jazyka, Moscow, 2000, pp. 213–232.
4. Dobrushina E. R., Kraveckij A. G., Poljakov A. E. 2015 “Korpus i chastotnyj grammaticheskij korpusnyj slovar' cerkovnoslavjanskogo jazyka v sostave NKRJa” (Corpus and Frequency Grammar Corpus Dictionary of Church-Slavonic Language in NCRL), in Trudy Instituta russkogo jazyka im. V. V. Vinogradova, 2015, vol. 6, pp. 116–141.
5. Dobrushina E.R., Poljakov A. E. 2013 “Korpus cerkovnoslavjanskogo jazyka: vozmozhnosti, metody sozdanija, perspektivy” (Corpus of Church-Slavonic Language: Possibilities, Creating Methods, Perspectives), in Vestnik PSTGU. Serija III: Filologija, 2013, vol. 1/31, pp. 32–44.
6. Zhivov V. M. Ocherki istoricheskoj morfologii russkogo jazyka XVII–XVIII vekov (Essays on Historical Morphology of Russian Language of XVII–XVIII Centuries), Moscow, 2004.
7. Zaliznjak A. A. Grammaticheskij slovar' russkogo jazyka: Slovoizmenenie (Grammar Dictionary of Russian Language: Inflection), Moscow, 2003.
8. Zobnin A. I., Pichhadze A. A. 2005 “Korpus drevnerusskih perevodov XI–XII vv.: rezul'taty i perspektivy” (Corpus of Old Russian Translations of XI–XII Cent.: Results and Perspectives), in Nauchno-tehnicheskaja informacija. Serija 2: Informacionnye processy i sistemy, 2005, vol. 3, pp. 44–47.
9. Klyshinskij Je. S. 2009 “Nekotorye slozhnosti avtomatizirovannoj lemmatizacii neslovarnyh slovoform” (Some Difficulties of Automated Lemmatization of Non-Dictionary Word Forms), in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog», Moscow, 2009, vol. 8/15, pp. 165–169.
10. Krivko R. N. (ed.) Slovar' russkogo jazyka XI–XVII vv. (Dictionary of Russian Language of XI–XVII Cent.), Moscow, 2015, vol. 30 (Tom’ — Uberechisja).
11. Krys'ko V. B. (ed.) Istoricheskaja grammatika drevnerusskogo jazyka (Historical Grammar of Old Russian Language), Moscow, 2000–2006, vol. 1–4.
12. Ljashevskaja O., Astaf'eva I., Bonch-Osmolovskaja A., Garejshina A., Grishina Ju., D'jachkov V., Ionov M., Koroleva A., Kudrinskij M., Litjagina A., Luchina E., Sidorova E., Toldova S., Savchuk S., Koval' S. 2010 “Ocenka metodov avtomaticheskogo analiza teksta: morfologicheskie parsery russkogo jazyka” (Valuation of Methods of Automatic Text Analysis: Morphological Parsers of Russian Language), in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog» (2010), 2010, Moscow, vol. 9/16, pp. 318–326.
13. Ljashevskaja O. N., Plungjan V. A., Sichinava D. V. 2005 “O morfologicheskom standarte Korpusa sovremennogo russkogo jazyka” (About Morphologic Standart of Modern Russian Language Corpus), in Nacional'nyj korpus russkogo jazyka: 2003–2005, Moscow, 2005, pp. 111–135.
14. Ljashevskaja O. N., Sichinava D. V., Kobricov B. P. 2007 “Avtomatizacija postroenija slovarja na materiale massiva neslovarnyh slovoform” (Automatization of Dictionary Structure on Material of Massive Non-Dictionary of Word Forms), in Braslavskij P. I. (ed.) Internet-matematika — 2007: Cbornik rabot uchastnikov konkursa nauchnyh proektov po informacionnomu poisku, Ekaterinburg, 2007, pp. 118–125.
15. Mishina E. I., Pichhadze A. A. 2015 “Drevnerusskij podkorpus Nacional'nogo korpusa russkogo jazyka” (Old Russian Sub-Corpus of National Russian Language Corpus), in Trudy Instituta russkogo jazyka im. V. V. Vinogradova RAN, 2015, vol. 6, pp. 99–115.
16. Moldovan A. M. 2015 “Pamjatniki drevnerusskoj pis'mennosti v Nacional'nom korpuse russkogo jazyka” (Memorials of Old Russian Literature in National Russian Language Corpus), in Trudy Instituta russkogo jazyka im. V. V. Vinogradova RAN, 2015, vol. 6, pp. 88–98.
17. Pichhadze A. A. 2005 “Korpus drevnerusskih perevodov XI–XII vv. i izuchenie perevodnoj knizhnosti Drevnej Rusi” (Corpus of Old Russian Translations of XI–XII Cent. and Study of Translation Literature of Old Rus’), in Nacional'nyj korpus russkogo jazyka: 2003–2005, Moscow, 2005, pp. 251–262.
18. Poljakov A. E. Grammaticheskij slovar' cerkovnoslavjanskogo jazyka (po materialam korpusa) (Grammar Dictionary of Church-Slavonic Language (on Corpus Materials)), in http://feb-web.ru/febupd/slavonic/dicgram.
19. Poljakov A. E. 2012 “Problemy i metody analiza russkih tekstov v doreformennoj orfografii” (Problems and Methods of Analysis of Russian Texts in Pre-Reform Orthography), in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam Mezhdunarodnoj konferencii «Dialog 2012», Moscow, 2012. vol. 11/18, pp. 536–547.
20. Poljakov A. E. 2012 “Korpus cerkovnoslavjanskih tekstov v sostave Nacional'nogo korpusa russkogo jazyka, pervaja versija: problemy i reshenija” (Church-Slavonic Texts Corpus in National Russian Language Corpus), in Doklad na mezhdunarodnoj nauchnoj konferencii «Informacionnye tehnologii i pis'mennoe nasledie (El’Manuscript-12)», Petrozavodsk, 2012.
21. Poljakov A. E. 2014 “Korpus cerkovnoslavjanskih tekstov: problemy orfografii i grafiki” (Church-Slavonic Texts Corpus: Problems of Orthography and Graphics), in Przegląd wschodnioeuropejski, 2014, vol. 5/1, pp. 245–254.
22. Poljakov A. E., Savchuk S. O., Sichinava D. V. 2013 “Grammaticheskij slovar' dlja avtomaticheskogo analiza tekstov XVIII–XIX vekov: pervye rezul'taty” (Grammar Dictionary for Automatic Text Analysis of XVIII–XIX Centuries: First Results), in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog», Moscow, 2013, vol. 12/19, pp. 633–654.
23. Sichinava D. V. 2014 “Istoricheskie korpusa Nacional'nogo korpusa russkogo jazyka kak instrument diahronicheskih issledovanij grammatiki” (Historical Corpuses of National Russian Language Corpus as Instrument of Diachronical Grammar Studies), in Baranov V. A., Zheljazkova V., Lavrent'ev A. M. (eds.) Pismenoto nasledstvo i informacionnite tehnologii: Materiali ot V mezhdunarodna nauch. konf. (Varna, 15–20 septemvri 2014 g.), Sofia; Izhevsk, 2014.
24. Sokirko A. V. 2010 “Bystroslovar': predskazanie morfologii russkih slov s ispol'zovaniem bol'shih lingvisticheskih resursov“, in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog», Moscow, 2010, vol. 9 (16), pp. 450–456.
25. Uspenskij B. A. Istorija russkogo literaturnogo jazyka (IX–XVII vv.) (History of Russian Literary Language (IX–XVII Cent.)), Moscow, 2002.
26. Berdichevskis A., Eckhoff H. M., Gavrilova T. 2016 “Forthcoming. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian”, in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog», 2016, vol. 15/22 (forthc.)
27. Jínová P., Lehečka B., Oliva K. 2014 “Describing Old Czech Declension Patterns for Automatic Text Analysis”, in Mundo Eslavo, 2014, vol. 13, pp 7–17.
28. Meyer R. 2009 “Semi-automatic morphosyntactic tagging of a diachronic corpus of Russian”, in Mahlberg M., González-Díaz V., Smith C. (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, Liverpool, 2009, pp. 20–23.
29. Meyer R. 2011 “New wine in old wineskins? Tagging Old Russian via annotation projection from modern translations”, in Russian linguistics, 2011, vol. 35/2, pp. 267–281.
30. Moon T., Baldridge J. 2007 “Part-of-speech tagging for middle English through alignment and projection of parallel diachronic texts”, in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Respublic, June 28–30, 2007, pp. 390–399.
31. Rocio V., Alves M. A., Lopes J. G., Xavier M. F., Vicente G. 1999 “Automated creation of a partially syntactically annotated corpus of Medieval Portuguese using contemporary Portuguese resources”, in Proceedings of the ATALA workshop on Treebanks, Paris, 1999.
32. Segalovich I. 2003 “A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine”, in Proceedings of MLMTA, Las Vegas, Nevada, 2003, pp. 273–280.
33. Sharoff S., Nivre J. 2011 “The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge”, in Komp'juternaja lingvistika i intellektual'nye tehnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferencii «Dialog» 2011, vol. 10/17, Мoscow, 2011.
34. Sporleder C. 2009 “Natural language processing for cultural heritage domains”, in Language and Linguistics Compass. 4. 9. 2009, pp. 750–768.
Gavrilova Tat'iana