Text Normalization for Social Media Corpus

Grigory Feoktistov, Dmitry Morozov

Abstract


Text markup is the process of enriching text with metalinguistic information, such as lemmata, morphological tags, and syntactic relations. Text markup is a fundamental task in computational linguistics and plays a key role in both fundamental and applied natural language processing. The scale of modern text corpora, with over a billion word tokens, has necessitated the use of automated tagging tools. Most such solutions have been developed and tested on texts with standard orthography. Consequently, their performance can vary significantly when applied to texts from social media, which contain non-standard usages. One way to overcome this problem is through pre-normalization of the text. This task is similar to, but not equivalent to, automatic spell correction and remains significantly less studied. Text normalization requires not only the correction of typos and spelling errors but also the restoration of abbreviations, especially those common in online communication. In this article, we present a corpus of Russian-language sentences from social media and their normalized variants, available at https://huggingface.co/datasets/ruscorpora/normalization. Using this corpus, we compiled a list of typical speech distortions and compared the effectiveness of different text normalization methods.


Full Text:

PDF (Russian)

References


S. O. Savchuk, T. Arkhangelskiy, A. A. Bonch-Osmolovskaya, O. V. Donina, Yu. N. Kuznetsova, O. N. Lyashevskaya, B. V. Orekhov and M. V. Podryadchikova, “Russian National Corpus 2.0: New opportunities and development prospect,”. Voprosy Jazykoznanija, No. 2, pp. 7–34, 2024.

R. van der Goot and G. van Noord, “Monoise: Modeling noise using a modular normalization system,” 2017. Available: https://arxiv.org/abs/1710.03476.

F. Doshi, J. Gandhi, D. Gosalia, S. Bagul, “Normalizing text using language modelling based on phonetics and string similarity,” 2020. Available: https://arxiv.org/abs/2006.14116.

D.H. Nguyen, A.T.H. Nguyen, K. Van Nguyen, “A weakly supervised data labeling framework for machine lexical normalization in vietnamese social media,” Cognitive Computation, Vol. 17, article 57, 2025, doi: 10.1007/s12559-024-10356-3.

A. Sorokin, T. Shavrina, A. Baytin, , I. Galinskaya and E. Rykunova, “SpellRuEval: The first competition on automatic spelling correction for Russian,” In Computational Linguistics and Intellectual Technologies Proceedings of the Annual International Conference “Dialogue” (2016), Issue 15, pp. 660–673, 2016.

A. Sorokin, “Spelling correction for morphologically rich language: a case study of Russian,” In T. Erjavec, J. Piskorski, L. Pivovarova, J. Snajder, J. Steinberger and R. Yangarber (eds.), Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 45–53, Association for Computational Linguistics, Valencia, Spain (Apr 2017), 2017, doi: 10.18653/v1/W17-1408.

N. Martynov, M. Baushenko, A. Kozlova, K. Kolomeytseva, A. Abramov and A. Fenogenova, “A methodology for generative spelling correction via natural spelling errors emulation across multiple domains and languages,” In Y. Graham and M. Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024. pp. 138–155. Association for Computational Linguistics, St. Julian’s, Malta (Mar 2024), 2024. Available: https://aclanthology.org/2024.findings-eacl.10/

D. Zmitrovich, A. Abramov, A. Kalmykov, V. Kadulin, M. Tikhonova, Taktasheva, E., Astafurov, D., Baushenko, M., Snegirev, A., T. Shavrina, S. S. Markov, V. Mikhailov, and A. Fenogenova, “A family of pretrained transformer language models for Russian,” In N. Calzolari, M.Y. Kan, V. Hoste, A. Lenci, S. Sakti and N. Xue (eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 507–524. ELRA and ICCL, Torino, Italia (May 2024), 2024. Available: https://aclanthology.org/2024.lrec-main.45/.

Y. Tang, C. Tran, X. Li, P.J. Chen, N. Goyal, V. Chaudhary, J. Gu and A. Fan, “Multilin-gual translation with extensible multilingual pretraining and finetuning,” 2020. Available: https://arxiv.org/abs/2008.00401.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162