Automatic Detection of Adjectival Vagueness in Russian Legal Texts: Dataset, Models, and Results

Alena E. Berlin, Olga V. Blinova

Abstract


This paper addresses the automatic classification of Russian sentences from legal documents (laws) into those with and without legal vagueness. A training dataset of 6,000 annotated sentences with vagueness loci was developed through collaboration between linguists and legal experts. The study focuses exclusively on vagueness introduced by gradable adjectives. We evaluate both classical machine learning models and transformer-based architectures. Data augmentation applied to RuBERT successfully resolves class imbalance, achieving an F1-score of 0.89. Analysis of linguistic features reveals, that adjectives with the negative prefix “ne-” predominantly occur in sentences without vagueness.


Full Text:

PDF (Russian)

References


F. Devos, “Semantic vagueness and lexical polyvalence,” Studia Linguistica, vol. 57, no. 3, 2003, pp. 121–141. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0039-3193.2003.00101.x

F. Devos, “Still fuzzy after all these years: a linguistic evaluation of the fuzzy set approach to semantic vagueness,” Quaderni di Semantica, vol. 16, no. 1, pp. 47–82, 1995.

S. Ramotowska, J. Haaf, L. Van Maanen, and J. Szymanik, “Most quantifiers have many meanings,” Psychonomic Bulletin and Review, vol. 31, no. 6, pp. 2692–2703, Dec. 2024. [Online]. Available: https://link.springer.com/article/10.3758/s13423-024-02502-7

C. Kennedy, “Ambiguity and vagueness: An overview,” in Semantics - Lexical Structures and Adjectives, C. Maienborn, K. von Heusinger, and P. Portner, Eds. Berlin, Boston: De Gruyter Mouton, 2019, pp. 236–271, Available: 10.1515/9783110626391-008.

O. Blinova and S. Belov, “Linguistic ambiguity and vagueness in Russian legal texts,” Vestnik of Saint Petersburg University. Law, vol. 11, no. 4, pp. 774–812, Jan. 2020, doi: 10.21638/spbu14.2020.401.

C. Kennedy, “Vagueness and grammar: the semantics of relative and absolute gradable adjectives,” Linguistics and Philosophy, vol. 30, no. 1, pp. 1–45, Mar. 2007. [Online]. Available: https://link.springer.com/article/10.1007/s10988-006-9008-0

G. I. Kustova, “Adjective,” Materials for the project of corpus description of Russian grammar. [Online]. Available: http://rusgram.ru/Прилагательное#111

O. Blinova and A. Berlin, “Creating a Dataset for Automatic Detection of Vague Expressions in Russian Legal Texts,” in Internet and Modern Society, vol. 2671, Communications in Computer and Information Science. Cham: Springer Nature Switzerland, 2026, pp. 177–195, Available: 10.1007/978-3-032-04958-2_14.

U. May, K. Zaczynska, J. Moreno-Schneider, and G. Rehm, “Extraction and Normalization of Vague Time Expressions in German,” in Proc. 17th Conf. Natural Language Processing (KONVENS 2021), Düsseldorf, Germany, Sep. 2021, pp. 114–126. [Online]. Available: https://aclanthology.org/2021.konvens-1.10/

B. D. Cruz, B. Jayaraman, A. Dwarakanath, and C. McMillan, “Detecting Vague Words & Phrases in Requirements Documents in a Multilingual Environment,” in 2017 IEEE 25th Int. Requirements Engineering Conf. (RE), 2017, pp. 233–242, Available: 10.1109/RE.2017.24.

S. V. Chepovetskaya, “Linguistic vagueness in oral speech of Russian officials: A corpus study,” Bachelor's thesis, Philology Program, National Research University Higher School of Economics, St. Petersburg, Russia, 2025.

P.-H. Paris, S. E. Aoud, and F. M. Suchanek, “The Vagueness of Vagueness in Noun Phrases,” in Conference on Automated Knowledge Base Construction, 2021. doi: 10.24432/C5T884.

A. Debnath and M. Roth, “A Computational Analysis of Vagueness in Revisions of Instructional Texts,” arXiv:2309.12107, Sep. 2023. [Online]. Available: http://arxiv.org/abs/2309.12107

G. Malik, S. Yildirim, M. Cevik, and A. Bener, “An Empirical Study on Vagueness Detection in Privacy Policy Texts,” in Proc. Canadian Conf. Artificial Intelligence, 2023, Available: 10.21428/594757db.2728303d.

J. Bhatia, T. D. Breaux, J. R. Reidenberg, and T. B. Norton, “A Theory of Vagueness and Privacy Risk Perception,” in 2016 IEEE 24th International Requirements Engineering Conference (RE), 2016, pp. 26–35. doi: 10.1109/RE.2016.20.

P. Alexopoulos and J. Pavlopoulos, “A Vague Sense Classifier for Detecting Vague Definitions in Ontologies,” in Proc. 14th Conf. European Chapter of the Association for Computational Linguistics, vol. 2: Short Papers, Gothenburg, Sweden, Apr. 2014, pp. 33–37. [Online]. Available: https://aclanthology.org/E14-4007/

D. I. Kulagin, “Publicly available sentiment dictionary for the Russian language KartaSlovSent,” in Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog” [Komp’yuternaia Lingvistika i Intellektual’nye Tekhnologii: Trudy Mezhdunarodnoj Konferentsii “Dialog”], 2021, pp. 1106–1119.

L. Lebanoff and F. Liu, “Automatic Detection of Vague Words and Sentences in Privacy Policies,” in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, Oct.-Nov. 2018, pp. 3508–3517. [Online]. Available: https://aclanthology.org/D18-1387/

S. Wang and C. D. Manning, “Baselines and Bigrams: Simple, Good Sentiment and Topic Classification,” in Proc. 50th Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, Jeju Island, Korea, 2012, pp. 90–94. [Online]. Available: https://aclanthology.org/P12-2018/

C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [Online]. Available: https://link.springer.com/article/10.1007/BF00994018

L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, Available: 10.1023/A:1010933404324.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794, Available: 10.1145/2939672.2939785.

L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” in Advances in Neural Information Processing Systems 31, S. Bengio et al., Eds. Curran Associates, Inc., 2018, pp. 6638–6648. [Online]. Available: https://papers.nips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html

D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992, Available: 10.1016/S0893-6080(05)80023-1.

Y. Kuratov and M. Arkhipov, “Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language,” arXiv:1905.07213, 2019. [Online]. Available: https://arxiv.org/abs/1905.07213

G. Lemaitre, F. Nogueira, and C. K. Aridas, “imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning,” J. Mach. Learn. Res., vol. 18, no. 17, pp. 1–5, 2017.

T. Kiss and J. Strunk, “Unsupervised Multilingual Sentence Boundary Detection,” Computational Linguistics, vol. 32, no. 4, pp. 485–525, 2006, Available: 10.1162/coli.2006.32.4.485.

S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. Harshman, “Indexing by Latent Semantic Analysis,” J. American Society for Information Science, vol. 41, no. 6, pp. 391–407, 1990, Available: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, Available: 10.1613/jair.953.

T. Shavrina, A. Fenogenova, A. Emelyanov, et al., “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark,” in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP), Online, Nov. 2020, pp. 4717–4726. [Online]. Available: https://aclanthology.org/2020.emnlp-main.381/

T. Wolf, L. Debut, V. Sanh, et al., “Transformers: State-of-the-Art Natural Language Processing,” in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations, Online, Nov. 2020, pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6/

H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009, Available: 10.1109/TKDE.2008.239.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), Minneapolis, Minnesota, 2019, pp. 4171–4186, Available: 10.18653/v1/N19-1423.

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” in Proc. 34th Int. Conf. Machine Learning, vol. 70, Sydney, Australia, 2017, pp. 3319–3328. [Online]. Available: https://arxiv.org/abs/1703.01365.

M. Korobov, “Morphological Analyzer and Generator for Russian and Ukrainian Languages,” in Analysis of Images, Social Networks and Texts, vol. 542, D. I. Ignatov et al., Eds. Cham: Springer International Publishing, 2015, pp. 320–332, Available: 10.1007/978-3-319-26123-2_31.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162