Methods to improve the accuracy of machine learning algorithms while reducing the dimensionality of the data set

A.V. Vorobyev


The limited availability of information collection is a factor hindering the application of high-performance machine learning algorithms. The development of methods to improve the accuracy of models while reducing the observation periods, can be an effective tool for prediction in understudied areas. The paper considers the relationship between the dimensionality of the data set and the predictive capabilities of machine learning models, and determines the impact of the number of observations on the accuracy and robustness of models built on ensemble algorithms and regularized regression algorithms. In the course of the experiments, the change in the weighted average absolute error with decreasing the dimensionality of the set was considered, and the algorithms most resistant to this factor were identified. The lower limit of use of ensemble algorithms for detection of regularities and construction of stable model, in regression tasks, in cases of non-linear dependence of target variable with predictors and under condition of absence of high impact of anomalies and noises in data was revealed. The effect of automated Bayesian hyperparameter optimization on model accuracy when the data set is reduced is considered. The models for which pre-optimization of hyperparameters, by means of wood-structured Parzen estimation, is the most preferable are determined.

Full Text:

PDF (Russian)


R. J. Little and D. B. Rubin. Statistical Analysis With Missing Data. Hoboken, NJ, USA: Wiley, 2014. DOI:10.1002/9781119013563

H. He and E. A. Garcia. Learning from imbalanced data,‖IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284. 2009. DOI:10.1109/TKDE.2008.239

C.Subhashri, J.Maruthupandi, K.Vimala Devi. Recovering Insufficient and Inconsistent Data using Fuzzy-Based Information Decomposition Technique. International Journal of Pure and Applied Mathematics. Volume 119 No. 12 2018.

Faber, F. A., Lindmaa, A., Lilienfeld, O. A. V. & Armiento, R. Machine learning energies of 2 million Elpasolite (ABC2D6) crystals. Phys. Rev. Lett. 117, 135502. 2016. DOI:10.1103/PhysRevLett.117.135502

Schmidt, J. et al. Predicting the thermodynamic stability of solids combining density functional theory and machine learning. Chem. Mater. 5090–5103. 2017. DOI:10.1021/acs.chemmater.7b00156

Ying Z., Chen L. A strategy to apply machine learning to small datasets in materials science. npj Computational Materials volume 4, Article number: 25. 2018. DOI:10.1038/s41524-018-0081-z

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30. NIPS. 2017.

Shalev-Shwartz, Shai, Ben-David, Shai. Decision Trees. Understanding Machine Learning. Cambridge University Press. 2014.

Wenjing Fang, Chaochao Chen, Jin Tan, Chaofan Yu, Yufei Lu, Li Wang, Lei Wang, Jun Zhou and Alex X. Liu. A Hybrid-Domain Framework for Secure Gradient Tree Boosting. In The 29th ACM International Conference on Information and Knowledge Management (CIKM’20), Galway, Ireland. ACM, New York, NY, USA, 2020.

Donald W. Marquardt & Ronald D. Snee. Ridge Regression in Practice, The American Statistician, 29:1, 3-20, 1975. DOI: 10.1080/00031305.1975.10479105

James Bergstra, R. Bardenet, Balázs Kégl, Y. Bengio. Algorithms for Hyper-Parameter Optimization. Conference: Advances in Neural Information Processing Systems. 2011.


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162