Volume 7 , Issue 1 , PP: 53- 65, 2022 | Cite this article as | XML | Html | PDF | Full Length Article
Sara Muslih Mishal 1 * , Murtadha M. Hamad 2
Doi: https://doi.org/10.54216/FPA.070105
Most of the information (more than 80%) is stored as text, and text mining is a very important process as it is an initial step in the process of text classification, and this is especially the case in the Arabic language. The Aim of The Study is to classify Arabic texts according to specific categories using advanced performance indicators We used Data Templates as a platform for managing and organizing Apache Spark to solve big data challenges. Apache Spark offers several integrated language APIs. nlp lib was used for text processing. The data is pre-processed through several steps, namely separating the words into one text on the basis of the space between words, cleaning the text of unwanted words, restoring the words to their roots, as well as the feature selection process is a critical step. in text classification. It is a preprocessing technology. In this paper, one way to determine which TF attributes are used how often each feature appears in the document is that they consider the first level of the feature selection process. Then we use TF-IDF to determine the significance of the feature in the document, and this is the last step in the preprocessing Outcomes Text classification . Results were evaluated using advanced performance indicators such as accuracy, Precision and recall. A high accuracy of 96.94% was achieved.The main objective of this paper is to classify basic texts quickly and accurately, according to the results as long as the feature size is suitable, the most advanced technology is superior to other pass rate methods due to the reasonable reliability and perfect pruning level.
Text Mining, Text Classification, CNN, Apache Spark, Databricks.
[1] S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, “Big data analytics on Apache Spark,” Int. J. Data Sci. Anal., vol. 1, no. 3–4, pp. 145–164, 2016, doi: 10.1007/s41060-016-0027-9.
[2] Hassin and Salah, “Gender Classification Based On Audio Features,” Mamoun Journal, p. 196, 2018, doi: 10.36458/1253-000-031-011.
[3] B. M. Nema and A. A. Abdul-Kareem, “Preprocessing signal for Speech Emotion Recognition,” Al-Mustansiriyah J. Sci., vol. 28, no. 3, pp. 157–165, 2018, doi: 10.23851/mjs.v28i3.48.
[4] R. R.P, K. Juliet, and A. hana, “Text Classification for Student Data Set using Naive Bayes Classifier and KNN Classifier,” Int. J. Comput. Trends Technol., vol. 43, no. 1, pp. 8–12, 2017, doi: 10.14445/22312803/ijctt-v43p103.
[5] S. George K and S. Joseph, “Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature,” IOSR J. Comput. Eng., vol. 16, no. 1, pp. 34–38, 2014, doi: 10.9790/0661-16153438.
[6] K. Grabczewski, Meta-learning in decision tree tnduction, vol. 498, no. June. 2014.
[7] M. Biniz, “Arabic Text Classification Using Deep Learning Technics,” no. April 2019, 2018, doi: 10.14257/ijgdc.2018.11.9.09.
[8] M. Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J. & Nithya, “Preprocessing Techniques for Text Mining Preprocessing Techniques for Text Mining,” Int. J. Comput. Sci. Commun. Networks, vol. 5, no. October 2014, pp. 7–16, 2015.
[9] L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, and J. M. Zurada, “Artificial intelligence and soft computing: 15th international conference, ICAISC 2016 Zakopane, Poland, June 12-16, 2016 proceedings, Part I,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), no. ML, pp. 621–630, 2016, doi: 10.1007/978-3-319-39378-0.
[10] A. El Kah and I. Zeroual, “The effects of Pre-Processing Techniques on Arabic Text Classification,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 10, no. 1, pp. 41–48, 2021, doi: 10.30534/ijatcse/2021/061012021.
[11] N. Pavlopoulou, A. Abushwashi, F. Stahl, and V. Scibetta, “A Text Mining Framework for Big Data,” Expert Updat., vol. 17, no. 1, 2017, [Online]. Available: https://www.exonar.com/platform/%0Ahttp://centaur.reading.ac.uk/70108/.
[12] P. Grover and A. K. Kar, “Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature,” Glob. J. Flex. Syst. Manag., vol. 18, no. 3, pp. 203–229, 2017, doi: 10.1007/s40171-017-0159-3.
[13] H. Sayed, M. A. Abdel-Fattah, and S. Kholief, “Predicting potential banking customer churn using Apache Spark ML and MLlib packages: A comparative study,” Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 11, pp. 674–677, 2018, doi: 10.14569/ijacsa.2018.091196.
[14] M. Alhawarat and A. O. Aseeri, “A Superior Arabic Text Categorization Deep Model (SATCDM),” IEEE Access, vol. 8, pp. 24653–24661, 2020, doi: 10.1109/ACCESS.2020.2970504.