Volume 4 , Issue 2 , PP: 53-61, 2025 | Cite this article as | XML | Html | PDF | Full Length Article
Bahodir Muminov 1 * , Ziyoda Norqulova 2
Doi: https://doi.org/10.54216/MOR.040206
Today, information systems handle large volumes of data from various sources. These data may differ in both form and meaning. Such data diversity is one of the main problems in network integration and analysis. This research paper analyzes the main types of data: digital, integer, text, categorical, temporal, logical, and spatial. Today, information systems work with large volumes of information obtained from various sources. This data can differ in both form and meaning. This diversity of data is one of the main problems in the processes of integration and analysis in the network. This research paper analyzes the main types of data: digital, integer, text, categorical, temporal, logical, and spatial. For each type of data, a normalization approach is selected that corresponds to it. In particular, we will study the min-max scaling and Z-score standardization methods for digital data, one-hot and label encoding for category attributes, as well as lemmatization and normalization based on Unicode for text data. The analysis shows that choosing the right approach for each data type increases the efficiency of unification, ontological mapping, and visualization. The article analyzes the advantages and limitations of existing normalization methods and provides practical recommendations for selecting optimal methods for processing network data. The proposed approach can be effectively used in the processes of semantic integration of multi-source network data, as well as to its visual analysis.
Data normalization , Network databases , Data types , Semantic consistency , Ontological mapping
[1] K. A. Sankpal and K. V. Metre, “A review on data normalization techniques,” International Journal of Engineering Research & Technology (IJERT), vol. 9, no. 6, pp. 885–889, 2020, doi: 10.17577/IJERTV9IS060915.
[2] Culotta, A. McCallum, and J. Betz, “Canonicalization of database records using adaptive edit distance,” in Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD ’07), 2007, pp. 104–113, doi: 10.1145/1281192.1281207.
[3] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom, “Swoosh: A generic approach to entity resolution,” The VLDB Journal, vol. 18, no. 1, pp. 255–276, 2009, doi: 10.1007/s00778-008-0098-x.
[4] Y. Dong, E. C. Dragut, and W. Meng, “Normalization of duplicate records from multiple sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 4, pp. 769–782, 2019, doi: 10.1109/TKDE.2018.2875227.
[5] M. L. Wick, K. Rohanimanesh, K. Schultz, and A. McCallum, “A unified approach for schema matching, coreference and canonicalization,” in Proc. 14th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD ’08), 2008, pp. 722–730, doi: 10.1145/1401890.1401977.
[6] S. Tejada, C. A. Knoblock, and S. Minton, “Learning object identification rules for information integration,” Information Systems, vol. 26, no. 8, pp. 607–633, 2001, doi: 10.1016/S0306-4379(01)00033-4.
[7] E. C. Dragut and W. Meng, “Meaningful labeling of integrated query interfaces,” in Proc. 32nd Int. Conf. Very Large Data Bases (VLDB ’06), 2006, pp. 679–690.
[8] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. Morgan Kaufmann, 2011.
[9] P.-N. Tan et al., Introduction to Data Mining. Pearson Education, 2019.
[10] D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
[11] M. Bishop, Pattern Recognition and Machine Learning. Springer.
[12] J. A. G. de Oliveira, A. M. A. de Lima, and R. C. F. de Souza, “Data mining techniques for predictive analytics in healthcare: A systematic review,” Journal of Biomedical Informatics, vol. 120, p. 103818, 2021, doi: 10.1016/j.jbi.2021.103818.
[13] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997.
[14] J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of images using a generalized normalization transformation,” 2015.
[15] S. Shekhar, S. Chawla, P. Zhang, and A. Lazarz, Spatial Databases: A Tour, 2nd ed. Pearson Education, 2015.
[16] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. Morgan Kaufmann Publishers, 2011.
[17] J. Abadi, The Design and Implementation of Modern Column-Oriented Database Systems. Springer, 2018.
[18] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed. Wiley, 2019.