Anomaly detection in real-time streaming data processing

D.E. Savitsky, M.E. Dunaev, K.S. Zaytsev

Abstract


The purpose of this work is to study methods for detecting anomalies in the processing of data streams in distributed streams in real time. To do this, the authors carried out a modification of the K-Means algorithm, called K-Means in real time, and carried out a comparative analysis of the effectiveness of the developed algorithm with K-Means from the MLlib library of the Apache Spark framework. The comparison confirmed the effectiveness of the proposed modification. To conduct experiments with the algorithm, a special data array (dataset) was built, which included about 1000 measurements of the Apache Kafka server log metrics with one topic, two providers and a consumer. Anomalous fragments have been added to this set of dates, with a large number of messages in the blink of an eye and/or size. The dataset values have been pre-processed to align the index of metrics and exclude correlations. Results developed by the authors of the K-Means algorithm for solving anomaly search problems, taking into account the detection time of its effectiveness.

Full Text:

PDF (Russian)

References


Sarvani A., Venugopal B., Devarakonda N. (2019) Anomaly Detection Using K-means Approach and Outliers Detection Technique. In: Ray K., Sharma T., Rawat S., Saini R., Bandyopadhyay A. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 742. Springer, Singapore.

Lemaire, V., Alaoui Ismaili, O., Cornu´ejols, A., Gay, D.: Predictive k-means with localmodels. In: Workshop LDRC-2020 (Workshop on Learning Data Representation for Clus-tering) in PAKDD-2020 (The 24th Pacific-Asia Conf. On Knowledge Discovery and DataMining), Singapore, 11-16 May 2020.

Tsigkritis, T., Groumas, G. and Schneider, M. (2018) On the Use of k-NN in Anomaly Detection. Journal of Information Security, 9, 70-84. doi: 10.4236/jis.2018.91006.

Unified engine for large-scale data analytics https://spark.apache.org/ Reviewed 01.10.2021

Apache Hadoop https://hadoop.apache.org/ Reviewed 01.10.2021

Wang, Z.; Zhou, Y.H.; Li, G.M. Anomaly Detection by Using Streaming K-Means and Batch K-Means. 2020 5th Ieee International Conference on Big Data Analytics (IEEE ICBDA 2020), Xiamen, China, 8–11 May 2020; pp. 11–17

Clustering - RDD-based API https://spark.apache.org/docs/latest/mllib-clustering.html Reviewed 01.10.2021

Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006; 27(8): 861–74

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia.

Vannel Zeufacka, Donghyun Kimb, Daehee Seoc, Ahyoung Leea An unsupervised anomaly detection frame-work for detecting anomalies in real time through network system’s log files analysis, High-Confidence Computing Volume 1, Issue 2, December 2021, 100030

Authors: D. Benmahdi, L. Rasolofondraibe, X. Chiementin, S. Murer, A. Felkaoui, RT-OPTICS: real-time classification based on OPTICS method to monitor bearings faults, Journal of Intelligent Manufacturing, Volume 30, Issue 5, June 2019, pp. 2157–2170

Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. 2020. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 1, 1, Article 1 (January 2020), 36 pages. https://doi.org/10.1145/3439950

Md Tahmid Rahman Laskar, Jimmy Xiangji Huang, Vladan Smetana, Chris Stewart, Kees Pouw, Aijun An, Stephen Chan, and Lei Liu. 2021. Extending Isolation Forest for Anomaly Detection in Big Data via K-Means. ACM Trans. Cyber-Phys. Syst. 5, 4, Article 41 (September 2021), 26 pages, DOI: https://doi.org/10.1145/3460976.


Refbacks

  • There are currently no refbacks.


Abava  Absolutech Convergent 2020

ISSN: 2307-8162