A Prediction Model for Lung Cancer Levels Based on Machine Learning

Huu-Huy Ngo, Hung Linh Le


Among cancers, lung cancer is one of the most dreaded conditions, and it is the leading cause of cancer-related deaths worldwide. Early cancer identification and prediction help prevent and treat cancer efficiently, especially the beginning cancer stage. Therefore, this study presents a prediction model for lung cancer level based on machine learning. Machine learning algorithms are applied as primary methods. Firstly, the dataset collection is implemented; then, feature selection algorithms are used to identify essential features. Secondly, the proposed model applies the machine learning algorithms on two datasets (The full dataset and the dataset of essential features). Finally, experimental results demonstrate that this proposed system has an excellent performance, with 100% and 98.7% accuracy on the full dataset and the dataset of the top three essential features, respectively.

Full Text:



I. Toumazis, M. Bastani, S. S. Han and S. K. Plevritis, “Risk-Based Lung Cancer Screening: A Systematic Review,” Lung Cancer, vol. 147, pp. 154–186, Sep. 2020.

Worldwide Cancer Data, World Cancer Research Fund International, 2018, https://www.wcrf.org/dietandcancer/cancer-trends/worldwide-cancer-data, [09-Aug-2020].

Z. Lyu, N. Li, S. Chen, G. Wang, F. Tan, X. Feng, X. Li, Y. Wen, Z. Yang, Y. Wang, J. Li, H. Chen, C. Lin, J. Ren, J. Shi, et al., “Risk Prediction Model for Lung Cancer Incorporating Metabolic Markers: Development and Internal Validation in a Chinese Population,” Cancer Medicine, vol. 9, no. 11, pp. 3983–3994, 2020.

H. Liu, “Feature Selection,” in Encyclopedia of Machine Learning, Boston, MA, USA: Springer, pp. 402–406, 2010.

A. K. Gárate-Escamila, A. Hajjam El Hassani and E. Andrès, “Classification Models for Heart Disease Prediction Using Feature Selection and PCA,” Informatics in Medicine Unlocked, vol. 19, pp. 1–11, Jan. 2020.

C. M. Tammemagi, P. F. Pinsky, N. E. Caporaso, P. A. Kvale, W. G. Hocking, T. R. Church, T. L. Riley, J. Commins, M. M. Oken, C. D. Berg and P. C. Prorok, “Lung Cancer Risk Prediction: Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial Models and Validation,” JNCI: Journal of the National Cancer Institute, vol. 103, no. 13, pp. 1058–1068, Jul. 2011.

H. A. Katki, S. A. Kovalchik, C. D. Berg, L. C. Cheung and A. K. Chaturvedi, “Development and Validation of Risk Models to Select Ever-Smokers for CT Lung Cancer Screening,” JAMA, vol. 315, no. 21, pp. 2300–2311, Jun. 2016.

V. Krishnaiah, D. G. Narsimha and D. N. S. Chandra, “Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques,” International Journal of Computer Science and Information Technologies, vol. 4, no. 1, pp. 39–45, 2013.

M. R. Spitz, W. K. Hong, C. I. Amos, X. Wu, M. B. Schabath, Q. Dong, S. Shete and C. J. Etzel, “A Risk Model for Prediction of Lung Cancer,” JNCI: Journal of the National Cancer Institute, vol. 99, no. 9, pp. 715–726, May 2007.

P. B. Bach, M. W. Kattan, M. D. Thornquist, M. G. Kris, R. C. Tate, M. J. Barnett, L. J. Hsieh and C. B. Begg, “Variations in Lung Cancer Risk Among Smokers,” JNCI: Journal of the National Cancer Institute, vol. 95, no. 6, pp. 470–478, Mar. 2003.

Lung Cancer Data, Data World, 2017, https://data.world/cancerdatahp/lung-cancer-data, [15-Sep-2020].

Lung Cancer Dataset, Kaggle, 2018, https://www.kaggle.com/yusufdede/lung-cancer-dataset, [15-Sep-2020].

Lung Cancer Dataset, UCI Machine Learning Repository, 1992, https://archive.ics.uci.edu/ml/datasets/Lung+Cancer, [15-Sep-2020].

J. Brownlee, “Feature Selection in Python with Scikit-Learn,” Machine Learning Mastery, 2014.

Md. R. H. Subho, Md. R. Chowdhury, D. Chaki, S. Islam and Md. M. Rahman, “A Univariate Feature Selection Approach for Finding Key Factors of Restaurant Business,” in Proceedings of IEEE Region 10 Symposium, Kolkata, India, pp. 605–610, Jun. 2019.

X. Chen and J. C. Jeong, “Enhanced Recursive Feature Elimination,” in Proceedings of Sixth International Conference on Machine Learning and Applications (ICMLA 2007), Cincinnati, Ohio, USA, pp. 429–435, Dec. 2007.

P. M. Granitto, C. Furlanello, F. Biasioli and F. Gasperi, “Recursive Feature Elimination with Random Forest for PTR-MS Analysis of Agroindustrial Products,” Chemometrics and Intelligent Laboratory Systems, vol. 83, no. 2, pp. 83–90, Sep. 2006.

I. Guyon, J. Weston, S. Barnhill and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, no. 1, pp. 389–422, Jan. 2002.

K. Yan and D. Zhang, “Feature Selection and Analysis on Correlated Gas Sensor Data With Recursive Feature Elimination,” Sensors and Actuators B: Chemical, vol. 212, pp. 353–363, Jun. 2015.

F. Song, Z. Guo and D. Mei, “Feature Selection Using Principal Component Analysis,” in Proceedings of International Conference on System Science, Engineering Design and Manufacturing Informatization, Yichang, China, pp. 27–30, Nov. 2010.

P. Geurts, D. Ernst and L. Wehenkel, “Extremely Randomized Trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, Apr. 2006.

C.-R. Dow, W.-K. Wang, H.-H. Ngo and S.-F. Hwang, “An Advising System for Parking Using Canny and k-NN Techniques,” Computer Science & Information Technology (CS & IT), vol. 9, no. 6, pp. 27–34, May 2019.

L.-Y. Hu, M.-W. Huang, S.-W. Ke and C.-F. Tsai, “The Distance Function Effect on K-Nearest Neighbor Classification for Medical Datasets,” SpringerPlus, vol. 5, no. 1, pp. 1–9, Aug. 2016.

K. Moorthy and M. S. Mohamad, “Random Forest for Gene Selection and Microarray Data Classification,” in Proceedings of Knowledge Technology, Kajang, Malaysia, pp. 174–183, Jul. 2011.

C. Liu, F. Tang and C. Leth Bak, “An Accurate Online Dynamic Security Assessment Scheme Based on Random Forest,” Energies, vol. 11, no. 7, pp. 1–17, Jul. 2018.

M. E. A. Budimir, P. M. Atkinson and H. G. Lewis, “A Systematic Review of Landslide Probability Mapping Using Logistic Regression,” Landslides, vol. 12, no. 3, pp. 419–436, Jun. 2015.

T. K. Hembram, G. C. Paul and S. Saha, “Spatial Prediction of Susceptibility to Gully Erosion in Jainti River Basin, Eastern India: A Comparison of Information Value and Logistic Regression Models,” Modeling Earth Systems and Environment, vol. 5, no. 2, pp. 689–708, Jun. 2019.

P. Valdiviezo-Diaz, F. Ortega, E. Cobos and R. Lara-Cabrera, “A Collaborative Filtering Approach Based on Naïve Bayes Classifier,” IEEE Access, vol. 7, pp. 108581–108592, Aug. 2019.

D. Lavanya and K. U. Rani, “Ensemble Decision Tree Classifier for Breast Cancer Data,” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17–24, Feb. 2012.

B. Thangaparvathi, D. Anandhavalli and S. M. Shalinie, “A High Speed Decision Tree Classifier Algorithm for Huge Dataset,” in Proceedings of International Conference on Recent Trends in Information Technology (ICRTIT), Chennai, Tamil Nadu, India, pp. 695–700, Jun. 2011.


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162