Fusion: Practice and Applications

Journal DOI

https://doi.org/10.54216/FPA

Submit Your Paper

2692-4048ISSN (Online) 2770-0070ISSN (Print)

Volume 19 , Issue 2 , PP: 194-210, 2025 | Cite this article as | XML | Html | PDF | Full Length Article

Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents

Dadang Heksaputra 1 * , Rahmat Gernowo 2 , R. Rizal Isnanto 3

  • 1 Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia; Faculty of Computer and Engineering, Department of Information System, Alma Ata University, Yogyakarta, Indonesia; Alma Ata Center for Medical Informatics, Alma Ata University, Yogyakarta, Indonesia - (dadang@almaata.ac.id)
  • 2 Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia - (rahmatgernowo@lecturer.undip.ac.id)
  • 3 Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia - (rizal_isnanto@yahoo.com)
  • Doi: https://doi.org/10.54216/FPA.190215

    Received: December 23, 2024 Revised: February 15, 2025 Accepted: March 05, 2025
    Abstract

    Data imbalance is a common problem in machine learning, specifically in classification, in which examples in a dominant class outnumber examples in a minority class many times over. Besides, such a problem keeps a model unable to discover meaningful patterns for a minority class —hence, such a problem reduces model performance specifically in terms of Recall and F1-Score.  In current work, activity is performed in overcoming data imbalance problem in sentence classification model of documents of assurance certificate for halal with a combination of over-sampling and under-sampling techniques, namely Adaptive Synthetic (ADASYN) and Tomek Links. Text Classification technique is adopted in classifying sentences regarding assurance of halal in documents of assurance certificate for halal Text Classification; since incorrect classification of such sentences is not preferable, therefore, it is important to make sure no information about halal product is missed out. Over-sampling techniques considered include the SMOTE, Borderline SMOTE, ADASYN, and SMOTENC, and under-sampling techniques include the Random Under-Sampler, Near Miss, and Tomek Links. As comparative result, best performance gain in terms of Accuracy (0.759), F1-Score (0.748), Recall (0.759), and Precision (0.768) is generated with ADASYN. In our use case, ADASYN + Tomek Links is effective; recall is important in case of classification of documents for assurance certificate for halal and therefore, we cannot miss any relevant sentences. The proposed approach remarkably enhances the accuracy level for halal-related sentence identification and can be adopted in the halal product checking systems in industries with a halal feature.

    Keywords :

    Data Imbalance , Halal Assurance Documents , Adaptive Synthetic (ADASYN) , Tomek Links , Text Classification , Halal Information Systems

    References

     

    Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents

     

    Dadang Heksaputra1, 2, 3,*, Rahmat Gernowo1, R. Rizal Isnanto1

    1Doctoral Program of Information System School of Postgraduate Studies, Diponegoro University, Semarang, Indonesia

    2Faculty of Computer and Engineering, Department of Information System, Alma Ata University, Yogyakarta, Indonesia

    3Alma Ata Center for Medical Informatics, Alma Ata University, Yogyakarta, Indonesia

    Emails: dadang@almaata.ac.id; rahmatgernowo@lecturer.undip.ac.id; rizal_isnanto@yahoo.com

     

    Abstract

    Data imbalance is a common problem in machine learning, specifically in classification, in which examples in a dominant class outnumber examples in a minority class many times over. Besides, such a problem keeps a model unable to discover meaningful patterns for a minority class —hence, such a problem reduces model performance specifically in terms of Recall and F1-Score.  In current work, activity is performed in overcoming data imbalance problem in sentence classification model of documents of assurance certificate for halal with a combination of over-sampling and under-sampling techniques, namely Adaptive Synthetic (ADASYN) and Tomek Links. Text Classification technique is adopted in classifying sentences regarding assurance of halal in documents of assurance certificate for halal Text Classification; since incorrect classification of such sentences is not preferable, therefore, it is important to make sure no information about halal product is missed out. Over-sampling techniques considered include the SMOTE, Borderline SMOTE, ADASYN, and SMOTENC, and under-sampling techniques include the Random Under-Sampler, Near Miss, and Tomek Links. As comparative result, best performance gain in terms of Accuracy (0.759), F1-Score (0.748), Recall (0.759), and Precision (0.768) is generated with ADASYN. In our use case, ADASYN + Tomek Links is effective; recall is important in case of classification of documents for assurance certificate for halal and therefore, we cannot miss any relevant sentences. The proposed approach remarkably enhances the accuracy level for halal-related sentence identification and can be adopted in the halal product checking systems in industries with a halal feature.

    Keywords: Data Imbalance; Halal Assurance Documents; Adaptive Synthetic (ADASYN); Tomek Links; Text Classification; Halal Information Systems

    Cite This Article As :
    Heksaputra, Dadang. , Gernowo, Rahmat. , Rizal, R.. Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications, vol. , no. , 2025, pp. 194-210. DOI: https://doi.org/10.54216/FPA.190215
    Heksaputra, D. Gernowo, R. Rizal, R. (2025). Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications, (), 194-210. DOI: https://doi.org/10.54216/FPA.190215
    Heksaputra, Dadang. Gernowo, Rahmat. Rizal, R.. Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications , no. (2025): 194-210. DOI: https://doi.org/10.54216/FPA.190215
    Heksaputra, D. , Gernowo, R. , Rizal, R. (2025) . Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications , () , 194-210 . DOI: https://doi.org/10.54216/FPA.190215
    Heksaputra D. , Gernowo R. , Rizal R. [2025]. Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents. Fusion: Practice and Applications. (): 194-210. DOI: https://doi.org/10.54216/FPA.190215
    Heksaputra, D. Gernowo, R. Rizal, R. "Over-Under Sampling Approach with Adaptive Synthetic and Tomek Links Methods to Handle Data Imbalance in Sentence Classification on Halal Assurance Certificate Documents," Fusion: Practice and Applications, vol. , no. , pp. 194-210, 2025. DOI: https://doi.org/10.54216/FPA.190215