Building a text corpus for automatic biographical facts extraction from Russian texts

A.V. Glazkova


The tasks of computer linguistics and machine learning related to natural language processing (NLP) often require the use of text corpora. Text corpora are specially prepared collection of documents equipped with text markup containing morphological, syntactic, semantic or other information. The data received from the text corpora is used in supervised machine learning for building classifiers of texts written in natural language and in other tasks associated with natural language processing and computer linguistics. The specificity of the information presented in the corpus, as well as the type of texts, is determined by the aim and tasks of the particular study. This article presents a tool for building a corpus of biographical texts in Russian. The process of building a text corpus includes two stages: the collection of texts and their markup. At the first stage we collected texts suitable for markup. Thus, we included in the corpus biographical articles placed in Wikipedia in free access. For this purpose, we developed an automatic parser based on open Python libraries. The second stage is the semantic markup of the text sentences and the selection of biographical facts. This stage took place in a semi-automatic mode. The article describes the features of the process of building the corpus of biographical facts, taxonomy of biographical facts using in our work, software implementation for text collecting and markup, text representation in the corpus and the characteristics of the prepared corpus.

Full Text:

PDF (Russian)


Meyers A. Corpus Linguistics for NLP, New York University, URL: Date of access: 14.06.2018.

Khokhlova M. A survey of Large Russian Corpora // Computer linguistics and computing ontologies. Proceedings of the XIX International Joint Scientific Conference. – Saint-Petersburg, 2016. – P. 74-77.

Khokhlova M. Large Corpora and Frequency Nouns // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016”. – Moscow, 2016. – P. 224-238.

Shang J. et al. Automated phrase mining from massive text corpora //IEEE Transactions on Knowledge and Data Engineering. – 2018.

Roll U., Correia R. A., Berger‐Tal O. Using machine learning to disentangle homonyms in large text corpora //Conservation Biology. – 2018. – Vol. 32. – №. 3. – P. 716-724.

Campillos L., Deléger L., Grouin C., Hamon T., Ligozat A.-L., Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annotated Text corpus (MERLOT) // Language Resources and Evaluation. – 2018. – Vol. 52(2). – P. 571-601.

Uhrig P., Evert S., Proisl T. Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes //Lexical Collocation Analysis. – Springer, Cham, 2018. – P. 111-140.

Jia C. et al. Concept decompositions for short text clustering by identifying word communities //Pattern Recognition. – 2018. – Vol. 76. – P. 691-703.

Sameen S. et al. Measuring Short Text Reuse for the Urdu Language //IEEE Access. – 2018. – Vol. 6. – P. 7412-7421.

Sojka P., Líška M., Růžička M. Building Corpora of Technical Texts: Approaches and Tools // Fifth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN. – Brno, 2011. – P. 71-82.

LitvinovaT., Zagorovskaya O., Litvinova O. Russian text corpora for deception detection studies // International Journal of Open Information Technologies. – 2017. - Vol. 5, № 11. – P. 58-63.

Zevakhina N., DzhakupovaS. Russian metalinguistic comparatives: a functional perspective // Working papers by NRU HSE. Series WP BRP "Linguistics". – 2015. – № 39.

Open Corpora, URL: Date of access: 14.06.2018.

Rubtsova Yu. Constructing a corpus for sentiment classification training // Software & Systems. – 2014. – n Vol. 1. – P. 7-78.

Rezanova Z. Linguistic corpus "Tomsk regional text": concept and structure // Tomsk State University Journal of Philology. – 2015. – Vol. 1(33). – P. 38-50.

Rezanova Z., Vesnina G. Meta-data and annotation design of the Russian-speaking bilinguals speech subcorpus in the structure of the Tomsk Regional Corpus // Voprosy Leksikografii Russian Journal of Lexicography. – 2016. – Vol. 1(9). – P. 29-39. DOI: 10.17223/22274200/9/3.

Dracheva Yu. Electronic body of dialective texts in the aspect of studying the dynamics of cultural concepts (on the example of the multimedia case of Vologda texts) // Contemporary Russian lexicology, lexicography and linvogeography. – 2014. – P. 114-121.

Medvedeva E. Classification biographies as one of the biographics research methods in the context of library branch // Tomsk State University Journal of Cultural Studies and Art History. – 2016. – Vol. 2(22). – P. 198-205.

Wikipedia, URL: Date of access: 17.03.2018.

da Costa dias Soares S.-F. Extraction of Biographical Information from Wikipedia Texts. – Lisbon, 2011.

Python 3.6.0., URL: Date of access: 14.06.2018.

Wikipedia 1.4.0, URL: Date of access: 14.06.2018.

.NET, URL: Date of access: 14.06.2018.

Zakharov V. Evaluation of Internet corpora of Russian // Proceedings of the International Conference “Corpus linguistics-2015”. – St. Petersburg, 2015. – P. 219–229.

Corpus of biographical texts, URL Date of access: 01.07.2018


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162