Journal of Cybersecurity and Information Management
JCIM
2690-6775
2769-7851
10.54216/JCIM
https://www.americaspg.com/journals/show/3971
2019
2019
Identify and Remove Duplicated Records Using Q-gram and Statistical Techniques from the Data Warehouse
University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq
Sura
Sura
University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq
Rihab
Hazim
University of Anbar, College of Computer Sciences and Information Technology, Anbar, Ramadi, 31001, Iraq
Yaqeen
Saad
University of Anbar, College of Islamic Sciences, Anbar, Ramadi, 31001, Iraq
Nadia
Mohammed
There are several real-world uses for the duplication system or record linkage. In order to help the system make the best judgments, it appears in a broad area of recognizing similar data, joining online papers in the wide web, detecting plagiarism, and allowing several applications to enter it. To improve the financial interest and applicability of logistics project, routing is crucial. The following is the issue with this study: Because duplicate receipts contain the same significant change in data restrictions and limitations, and the data change itself is minor, the duplicate record data is ambiguous to other redacted records that are reassembled with the same customer. The purpose of this study is to use statistical techniques and the Q-gram to discover the best method for the detection and removal of duplicate records. We propose the following goals to help achieve that goal: Reduce the size of the data warehouse (DW) by providing a data warehouse free of duplicates. Decrease the amount of time spent looking for the (DW) and improve the DSS. The approach is divided into two stages: first, identify similarity records based on Q-gram similarity; second, determine whether classification records may be improved by statistical methods. The percentage threshold of 0.68 has been determined. It goes through a statistical process that decides whether this record is duplicated if the key ratio similarity is surpassed. The accuracy of the suggested work is 79%.
2026
2026
01
09
10.54216/JCIM.170101
https://www.americaspg.com/articleinfo/2/show/3971