Accuracy analysis of machine learning models using vectorization methods for heterogeneous text data classification tasks

A.N. Alpatov; K.S. Popov; A.N. Chesalin

Accuracy analysis of machine learning models using vectorization methods for heterogeneous text data classification tasks

A.N. Alpatov, K.S. Popov, A.N. Chesalin

Abstract

This paper investigates the problem of natural language processing using machine learning techniques, in particular, classification of unstructured heterogeneous text data sets. The paper presents a comparative analysis of some relevant and widely used methods and teacher-assisted machine learning models used for multi-class classification on heterogeneous textual data sources using different feature extraction methods. The dependence of the accuracy of class prediction by classifier models on the quality of the text data corpora used in this paper, applying different vectorization methods on the processed set of source data, is considered. Based on this analysis, a generalized scheme of the software functioning, which implements the algorithm for constructing a model of classification of unstructured texts, in the form of a pipeline for processing text corpus and control of machine learning models is proposed. During the experiment, it was demonstrated that for corpora with different quality of initial text data, the accuracy of classifier predictions differed. This circumstance manifested itself in the fact that the classifiers have lower performance on the corpus of texts of musical compositions and high on the texts of news summaries. It is shown that under certain conditions, the use of solutions to improve the quality of classification, such as stacking and adding additional features of classification, can lead not to improvement, but on the contrary to the deterioration of the results of class prediction, which, ultimately, can have a negative impact on the final accuracy of the obtained model results.

Full Text:

PDF (Russian)

References

G. O. Young, “Synthetic structure of industrial plastics (Book style with paper title and editor),” in Plastics, 2nd ed. vol. 3, J. Peters, Ed. New York: McGraw-Hill, 1964, pp. 15–64.

Eprev A.S. Automatic classification of text documents. Mathematical structures and modeling. 2010. issue. 21, pp. 65-81.

Poletaeva N.G. Classification of machine learning systems Bulletin of the Baltic Federal University. I. Kant. Series: Physical, mathematical and technical sciences. 2020. №1. pp. 5-22.

Fedyushkin N. A., Fedosin S. A. On the choice of methods for vectorization of textual information. Scientific and technical bulletin of the Volga region. 2019. V. 6. pp. 129-134.

Multi-Lingual Lyrics for Genre Classification [Online]. Available: https://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification. Accessed: 21.02.2022

(10)Dataset Text Document Classification. [Online]. Available: https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification. Accessed: 21.02.2022

Klimov D.V. Preprocessing of text messages for the metric classifier. Science symbol. 2017. No. 12. pp.25-32

Musaev A. A. et al. Review of modern technologies for extracting knowledge from text messages. Computer research and modeling. 2021 Vol. 13. No. 6. pp. 1291–1315 DOI: 10.20537/2076-7633-2021-13-6-1291-1315

Bolshakova E.I., Vorontsov K.V., Efremova N.E., Klyshinsky E.S., Lukashevich N.V., Sapin A.S. Automatic processing of texts in natural language and data analysis: textbook. allowance. Moscow.: Publishing House of the National Research University Higher School of Economics. 2017. 269 p.

sklearn.feature_extraction.text.HashingVectorizer, scikit-learn 1.0.2 documentation [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html. Accessed: 3.04.2022

Refbacks

There are currently no refbacks.

Abava Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162

International Journal of Open Information Technologies