Volume 11 , Issue 1 , PP: 44-54, 2024 | Cite this article as | XML | Html | PDF | Full Length Article
Firuz Kamalov 1 * , Said Elnaffar 2 , Aswani Cherukuri 3 , Annapurna Jonnalagadda 4
Doi: https://doi.org/10.54216/JISIoT.110105
Feature selection is an important preprocessing step in many data science and machine learning applications. Although there exist several sophisticated feature selection algorithms, their benefits are sometimes overshadowed by their complexity and slow execution. Therefore, in many cases, a more simple algorithm may be better suited. In this paper, we demonstrate that a rudimentary forward selection algorithm can achieve optimal performance with a low time complexity. Our study is based on an extensive empirical evaluation of the forward feature selection algorithm in the context of linear regression. Concretely, we compare the forward selection algorithm against the gold standard exhaustive search algorithm based on several datasets. The results show that the forward selection algorithm achieves high performance with relatively fast execution. Given the simplicity, accuracy, and speed of the forward feature selection algorithm, we recommend it as a primary feature selection method for most regression applications. Our results are particularly pertinent in the case of big data and real-time analysis.
data transformation , data mining , standardization
[1] Dokeroglu, T., Deniz, A., & Kiziloz, H. E. (2022). A comprehensive survey on recent metaheuristics for feature selection. Neurocomputing, 494, 269-296.
[2] Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., & O’Sullivan, J. M. (2022). A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2, 927312.
[3] Kamalov, F., Thabtah, F., & Leung, H. H. (2023). Feature selection in imbalanced data. Annals of Data Science, 10(6), 1527-1541.
[4] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46, 389-422.
[5] Kamalov, F., Sulieman, H., Moussa, S., Reyes, J. A., & Safaraliev, M. (2023). Nested ensemble selection: An effective hybrid feature selection method. Heliyon, 9(9).
[6] Chen, H., Xu, K., Chen, L., & Jiang, Q. (2021). Self-Expressive Kernel Subspace Clustering Algorithm for Categorical Data with Embedded Feature Selection. Mathematics, 9(14), 1680.
[7] Gurrib, I., Kamalov, F., Starkova, O., Elshareif, E. E., & Contu, D. (2023). Drivers of the next-minute Bitcoin price using sparse regre ssions. Studies in Economics and Finance.
[8] Kour, H., Pandith, V., Manhas, J., & Sharma, V. (2023). Machine Learning-Based Hybrid Model for Wheat Yield Prediction. Machine Intelligence, Big Data Analytics, and IoT in Image Processing: Practical Applications, 151-176.
[9] Li, M., Wang, H., Yang, L., Liang, Y., Shang, Z., & Wan, H. (2020). Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction. Expert Systems with Applications, 150, 113277.
[10] Alelyani, S. (2021). Stable bagging feature selection on medical data. Journal of Big Data, 8(1), 1-18.
[11] Yin, Y., Jang-Jaccard, J., Xu,W., Singh, A., Zhu, J., Sabrina, F., & Kwak, J. (2023). IGRF-RFE: a hybrid feature selection method for MLP-based network intrusion detection on UNSW-NB15 dataset. Journal of Big Data, 10(1), 1-26.
[12] Deng, X., Li, M., Deng, S., & Wang, L. (2022). Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Medical & Biological Engineering & Computing, 60(3), 663-681.
[13] Abu Khurma, R., Aljarah, I., Sharieh, A., Abd Elaziz, M., Damaˇseviˇcius, R., & Krilaviˇcius, T. (2022). A review of the modification strategies of the nature inspired algorithms for feature selection problem. Mathematics, 10(3), 464.
[14] Kareem, S. S., Mostafa, R. R., Hashim, F. A., & El-Bakry, H. M. (2022). An effective feature selection model using hybrid metaheuristic algorithms for iot intrusion detection. Sensors, 22(4), 1396.
[15] Tiwari, A., & Chaturvedi, A. (2022). A hybrid feature selection approach based on information theory and dynamic butterfly optimization algorithm for data classification. Expert Systems with Applications, 196, 116621.
[16] EL-Hasnony, I. M., Elhoseny, M., & Tarek, Z. (2022). A hybrid feature selection model based on butterfly optimization algorithm: COVID-19 as a case study. Expert Systems, 39(3), e12786.
[17] Afza, F., Sharif, M., Khan, M. A., Tariq, U., Yong, H. S., & Cha, J. (2022). Multiclass skin lesion classification using hybrid deep features selection and extreme learning machine. Sensors, 22(3), 799.
[18] Mahendran, N., & PM, D. R. V. (2022). A deep learning framework with an embedded-based feature selection approach for the early detection of the Alzheimer’s disease. Computers in Biology and Medicine, 141, 105056.
[19] Satrya, G. B., Ramatryana, I. N. A., & Shin, S. Y. (2023). Compressive Sensing of Medical Images Based on HSV Color Space. Sensors, 23(5), 2616.
[20] Mohamed, T., Ibrahim, A., Faiz, T., Alhasan, W., Atta, A., Mago, V., ... & Munir, S. (2022, October). Intelligent Hand Gesture Recognition System Empowered With CNN. In 2022 International Conference on Cyber Resilience (ICCR) (pp. 1-8). IEEE.
[21] Flores, E. (n.d.). Direct-Mail Fundraising. RPubs. Retrieved June 7, 2022, from https://rpubs.com/elizabethfl/646805
[22] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553.
[23] Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.
[24] Hamidieh, K. (2018). A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154, 346-354.