Volume 7 , Issue 1 , PP: 38-52, 2024 | Cite this article as | XML | Html | PDF | Full Length Article
Faris H. Rizk 1 * , Ahmed Saleh 2 , Abdulrhman Elgaml 3 , Ahmed Elsakaan 4 , Ahmed Mohamed Zaki 5
Doi: https://doi.org/10.54216/JAIM.070103
Student-centered analysis of academic performance is also the most important aspect in improving education by being able to determine what measures work best, individualized learning approaches, and intervention programs. In this study, we performed a detailed analysis based on the "Students Performance in Exams" dataset and different regression methods to estimate students' grades. We sought to assess the functioning of numerous metrics and determine an optimal model for this task. Our descriptive analysis identified meaningful trends within this dataset, as it includes central factors like 'gender," race/ethnic diversity-based status of a student,' and parental education level based on which the children are catered to by informing them about important lunches and test preparation courses alongside scores in "Math," Readings," Writing" etc. We used a wide range of regression models: XGBoost, CatBoost, GradientBoostingRegressor, etc. Metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Average Marginal Loss were used to assess each model rigorously. Importantly, the XGBoost model gave out an MSE value of 0.028, which was the best among all values obtained from various other models. The superiority of the XGBoost model is supported by the excellent performance that was reported across many metrics. This work can be important for informing educational practitioners and policymakers regarding the best possible accurate and realistic model that would predict the students' outcome results. Educational data analytics incorporating the XGBoost model can be used for the customization of interventions and mapping resource allocation while promoting a results-oriented approach based on data in education. This study is a step towards the accumulation of knowledge on educational data analytics. It can serve as a background for further research aimed at improving predictive models regarding student performance.
Predictive Modeling , Education Analytics , Regression Models , Students Performance , Descriptive Analysis , XGBoost
[1] Costa, E. B., Fonseca, B., Santana, M. A., de Araújo, F. F., & Rego, J. (2017). Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Computers in Human Behavior, 73, 247–256. https://doi.org/10.1016/j.chb.2017.01.047
[2] Liu, Q., Huang, Z., Yin, Y., Chen, E., Xiong, H., Su, Y., & Hu, G. (2021). EKT: Exercise-Aware Knowledge Tracing for Student Performance Prediction. IEEE Transactions on Knowledge and Data Engineering, 33(1), 100–115. https://doi.org/10.1109/TKDE.2019.2924374
[3] El Aissaoui, O., El Alami El Madani, Y., Oughdir, L., Dakkak, A., & El Allioui, Y. (2020). A Multiple Linear Regression-Based Approach to Predict Student Performance. In M. Ezziyyani (Ed.), Advanced Intelligent Systems for Sustainable Development (AI2SD’2019) (pp. 9–23). Springer International Publishing. https://doi.org/10.1007/978-3-030-36653-7_2
[4] Naser, M. Z., & Alavi, A. H. (2023). Error Metrics and Performance Fitness Indicators for Artificial Intelligence and Machine Learning in Engineering and Sciences. Architecture, Structures and Construction, 3(4), 499–517. https://doi.org/10.1007/s44150-021-00015-8
[5] Waheed, H., Hassan, S.-U., Aljohani, N. R., Hardman, J., Alelyani, S., & Nawaz, R. (2020). Predicting academic performance of students from VLE big data using deep learning models. Computers in Human Behavior, 104, 106189. https://doi.org/10.1016/j.chb.2019.106189
[6] Tomasevic, N., Gvozdenovic, N., & Vranes, S. (2020). An overview and comparison of supervised data mining techniques for student exam performance prediction. Computers & Education, 143, 103676. https://doi.org/10.1016/j.compedu.2019.103676
[7] Francis, B. K., & Babu, S. S. (2019). Predicting Academic Performance of Students Using a Hybrid Data Mining Approach. Journal of Medical Systems, 43(6), 162. https://doi.org/10.1007/s10916-019-1295-4
[8] Xu, X., Wang, J., Peng, H., & Wu, R. (2019). Prediction of academic performance associated with internet usage behaviors using machine learning algorithms. Computers in Human Behavior, 98, 166–173. https://doi.org/10.1016/j.chb.2019.04.015
[9] Miguéis, V. L., Freitas, A., Garcia, P. J. V., & Silva, A. (2018). Early segmentation of students according to their academic performance: A predictive modelling approach. Decision Support Systems, 115, 36–51. https://doi.org/10.1016/j.dss.2018.09.001
[10] Alam, A., & Mohanty, A. (2023). Predicting Students’ Performance Employing Educational Data Mining Techniques, Machine Learning, and Learning Analytics. In R. S. Tomar, S. Verma, B. K. Chaurasia, V. Singh, J. H. Abawajy, S. Akashe, P.-A. Hsiung, & R. Prasad (Eds.), Communication, Networks and Computing (pp. 166–177). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-43140-1_15
[11] Helal, S., Li, J., Liu, L., Ebrahimie, E., Dawson, S., Murray, D. J., & Long, Q. (2018). Predicting academic performance by considering student heterogeneity. Knowledge-Based Systems, 161, 134–146. https://doi.org/10.1016/j.knosys.2018.07.042
[12] Grivokostopoulou, F., Perikos, I., & Hatzilygeroudis, I. (2017). An Educational System for Learning Search Algorithms and Automatically Assessing Student Performance. International Journal of Artificial Intelligence in Education, 27(1), 207–240. https://doi.org/10.1007/s40593-016-0116-x
[13] Huang, C., Zhou, J., Chen, J., Yang, J., Clawson, K., & Peng, Y. (2023). A feature weighted support vector machine and artificial neural network algorithm for academic course performance prediction. Neural Computing and Applications, 35(16), 11517–11529. https://doi.org/10.1007/s00521-021-05962-3
[14] Stieglitz, S., Mirbabaie, M., Ross, B., & Neuberger, C. (2018). Social media analytics – Challenges in topic discovery, data collection, and data preparation. International Journal of Information Management, 39, 156–168. https://doi.org/10.1016/j.ijinfomgt.2017.12.002
[15] Raposo, F., Borja, R., & Ibelli-Bianco, C. (2020). Predictive regression models for biochemical methane potential tests of biomass samples: Pitfalls and challenges of laboratory measurements. Renewable and Sustainable Energy Reviews, 127, 109890. https://doi.org/10.1016/j.rser.2020.109890
[16] Qiu, Y., Zhou, J., Khandelwal, M., Yang, H., Yang, P., & Li, C. (2022). Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Engineering with Computers, 38(5), 4145–4162. https://doi.org/10.1007/s00366-021-01393-9
[17] Huang, G., Wu, L., Ma, X., Zhang, W., Fan, J., Yu, X., Zeng, W., & Zhou, H. (2019). Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions. Journal of Hydrology, 574, 1029–1041. https://doi.org/10.1016/j.jhydrol.2019.04.085
[18] Keprate, A., & Ratnayake, R. M. C. (2017). Using gradient boosting regressor to predict stress intensity factor of a crack propagating in small bore piping. 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), 1331–1336. https://doi.org/10.1109/IEEM.2017.8290109
[19] Xie, M., & Tian, Z. (2018). A review on pipeline integrity management utilizing in-line inspection data. Engineering Failure Analysis, 92, 222–239. https://doi.org/10.1016/j.engfailanal.2018.05.010
[20] Song, Y., Liang, J., Lu, J., & Zhao, X. (2017). An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing, 251, 26–34. https://doi.org/10.1016/j.neucom.2017.04.018
[21] Reza, M., & Haque, M. A. (2020). Photometric redshift estimation using ExtraTreesRegressor: Galaxies and quasars from low to very high redshifts. Astrophysics and Space Science, 365(3), 50. https://doi.org/10.1007/s10509-020-03758-w
[22] Schmidt, A. F., & Finan, C. (2018). Linear regression and the normality assumption. Journal of Clinical Epidemiology, 98, 146–151. https://doi.org/10.1016/j.jclinepi.2017.12.006
[23] Zhang, F., & O’Donnell, L. J. (2020). Chapter 7—Support vector regression. In A. Mechelli & S. Vieira (Eds.), Machine Learning (pp. 123–140). Academic Press. https://doi.org/10.1016/B978-0-12-815739-8.00007-9
[24] Pham, B. T., Nguyen, M. D., Bui, K.-T. T., Prakash, I., Chapi, K., & Bui, D. T. (2019). A novel artificial intelligence approach based on Multi-layer Perceptron Neural Network and Biogeography-based Optimization for predicting coefficient of consolidation of soil. CATENA, 173, 302–311. https://doi.org/10.1016/j.catena.2018.10.004
[25] Couronné, R., Probst, P., & Boulesteix, A.-L. (2018). Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformatics, 19(1), 270. https://doi.org/10.1186/s12859-018-2264-5
[26] Pekel, E. (2020). Estimation of soil moisture using decision tree regression. Theoretical and Applied Climatology, 139(3), 1111–1119. https://doi.org/10.1007/s00704-019-03048-8
[27] Lian, Z., Ma, Y., Li, M., Lu, W., & Zhou, W. (2024). Discovery Precision: An effective metric for evaluating performance of machine learning model for explorative materials discovery. Computational Materials Science, 233, 112738. https://doi.org/10.1016/j.commatsci.2023.112738