Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents

Dadang; Rahmat; R. Rizal

doi:https://doi.org/10.54216/FPA.190215

Full Length Article

Volume 19 • Issue 2 • PP: 194-210 • 2025

Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents

Dadang Heksaputra ^*

mail

,

Rahmat Gernowo ²

mail

,

R. Rizal Isnanto ²

mail

¹Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia; Faculty of Computer and Engineering, Department of Information System, Alma

²Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia

* Corresponding Author.

DOI https://doi.org/10.54216/FPA.190215

format_quote Cite this article

verified

Open Access & Copyright

Received: December 23, 2024 Revised: February 15, 2025 Accepted: March 05, 2025

View PDF open_in_new

Abstract

Data imbalance is a common problem in machine learning, specifically in classification, in which examples in a dominant class outnumber examples in a minority class many times over. Besides, such a problem keeps a model unable to discover meaningful patterns for a minority class —hence, such a problem reduces model performance specifically in terms of Recall and F1-Score. In current work, activity is performed in overcoming data imbalance problem in sentence classification model of documents of assurance certificate for halal with a combination of over-sampling and under-sampling techniques, namely Adaptive Synthetic (ADASYN) and Tomek Links. Text Classification technique is adopted in classifying sentences regarding assurance of halal in documents of assurance certificate for halal Text Classification; since incorrect classification of such sentences is not preferable, therefore, it is important to make sure no information about halal product is missed out. Over-sampling techniques considered include the SMOTE, Borderline SMOTE, ADASYN, and SMOTENC, and under-sampling techniques include the Random Under-Sampler, Near Miss, and Tomek Links. As comparative result, best performance gain in terms of Accuracy (0.759), F1-Score (0.748), Recall (0.759), and Precision (0.768) is generated with ADASYN. In our use case, ADASYN + Tomek Links is effective; recall is important in case of classification of documents for assurance certificate for halal and therefore, we cannot miss any relevant sentences. The proposed approach remarkably enhances the accuracy level for halal-related sentence identification and can be adopted in the halal product checking systems in industries with a halal feature.

Keywords

Data Imbalance Halal Assurance Documents Adaptive Synthetic (ADASYN) Tomek Links Text Classification Halal Information Systems

References

Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents

Dadang Heksaputra^{1, 2, 3,*}, Rahmat Gernowo¹, R. Rizal Isnanto¹

¹Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia

²Faculty of Computer and Engineering, Department of Information System, Alma Ata University, Yogyakarta, Indonesia

³Alma Ata Center for Medical Informatics, Alma Ata University, Yogyakarta, Indonesia

Emails: dadang@almaata.ac.id; rahmatgernowo@lecturer.undip.ac.id; rizal_isnanto@yahoo.com

Abstract

Data imbalance is a common problem in machine learning, specifically in classification, in which examples in a dominant class outnumber examples in a minority class many times over. Besides, such a problem keeps a model unable to discover meaningful patterns for a minority class —hence, such a problem reduces model performance specifically in terms of Recall and F1-Score. In current work, activity is performed in overcoming data imbalance problem in sentence classification model of documents of assurance certificate for halal with a combination of over-sampling and under-sampling techniques, namely Adaptive Synthetic (ADASYN) and Tomek Links. Text Classification technique is adopted in classifying sentences regarding assurance of halal in documents of assurance certificate for halal Text Classification; since incorrect classification of such sentences is not preferable, therefore, it is important to make sure no information about halal product is missed out. Over-sampling techniques considered include the SMOTE, Borderline SMOTE, ADASYN, and SMOTENC, and under-sampling techniques include the Random Under-Sampler, Near Miss, and Tomek Links. As comparative result, best performance gain in terms of Accuracy (0.759), F1-Score (0.748), Recall (0.759), and Precision (0.768) is generated with ADASYN. In our use case, ADASYN + Tomek Links is effective; recall is important in case of classification of documents for assurance certificate for halal and therefore, we cannot miss any relevant sentences. The proposed approach remarkably enhances the accuracy level for halal-related sentence identification and can be adopted in the halal product checking systems in industries with a halal feature.

Keywords: Data Imbalance; Halal Assurance Documents; Adaptive Synthetic (ADASYN); Tomek Links; Text Classification; Halal Information Systems

Cite This Article

Choose your preferred format

format_quote

Heksaputra, Dadang, Gernowo, Rahmat, Isnanto, R. Rizal. "Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents." Fusion: Practice and Applications, vol. Volume 19, no. Issue 2, 2025, pp. 194-210. DOI: https://doi.org/10.54216/FPA.190215

Heksaputra, D., Gernowo, R., Isnanto, R. (2025). Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications, Volume 19(Issue 2), 194-210. DOI: https://doi.org/10.54216/FPA.190215

Heksaputra, Dadang, Gernowo, Rahmat, Isnanto, R. Rizal. "Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents." Fusion: Practice and Applications Volume 19, no. Issue 2 (2025): 194-210. DOI: https://doi.org/10.54216/FPA.190215

Heksaputra, D., Gernowo, R., Isnanto, R. (2025) 'Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents', Fusion: Practice and Applications, Volume 19(Issue 2), pp. 194-210. DOI: https://doi.org/10.54216/FPA.190215

Heksaputra D, Gernowo R, Isnanto R. Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications. 2025;Volume 19(Issue 2):194-210. DOI: https://doi.org/10.54216/FPA.190215

D. Heksaputra, R. Gernowo, R. Isnanto, "Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents," Fusion: Practice and Applications, vol. Volume 19, no. Issue 2, pp. 194-210, 2025. DOI: https://doi.org/10.54216/FPA.190215

policy

Publisher's Note

The statements, opinions, and data presented in this article are solely those of the author(s) and do not necessarily represent those of ASPG, the journal, or its editors. ASPG and the editors disclaim responsibility for any harm arising from the use of any ideas, methods, instructions, or products described in this article, to the fullest extent permitted by applicable law.

Digital Archive Ready