Filtering Big Data with Optimized Hybrid Algorithm for IoT-Based Data Selection

Sarvesh Kumar¹, Satyajee Srivastava², Surendra Kumar^*3, Arun Kumar Saini⁴, Neeraj Verma⁵, Dhiraj Kapila⁶

^1,4,5Department of CSE, IcfaiTech, The ICFAI University, Jaipur, India

² School of Computer Science and Engineering, Galgotias University, Uttar Pradesh, India

^3*Department of computer Engineering and Applications, GLA University, Mathura, India

⁶Department of Computer Science & Engineering, Lovely Professional University, Punjab, India

Emails: skumarcse4@gmail.com; drsatyajee@gmail.com; kumar.surendra1989@gmail.com; arunsaini1@gmail.com; er.neerajkumar@gmail.com; dhiraj.23509@lpu.co.in

Abstract

Data management across servers has grown problematic because of technological advancements in data processing and storage capacities. Data that is neither organized nor labelled adds an additional layer of difficulty to the storing and retrieving processes. This data, which is not tagged, requires analytic techniques that are more powerful and time efficient. Clustering has long been regarded as one of the most effective methods for managing large amounts of data; nonetheless, larger volumes can lead to unexpectedly poor accuracy when using conventional clustering methodologies. In this study, we suggest the use of a novel framework for the clustering of large amounts of data. The preprocessing stage is one of the most important parts in the data cleansing process; hence, a global stop-word list is used to filter the contents of the files before sending them on to the cluster distribution stage. A meta-heuristic focused Genetic Algorithm (GA) is utilized to eradicate the redundant information present in the datasets. In addition to the generalized attributable fitness function, an attribute-based innovative fitness function (f) is being developed. To determine how well proposed method performs, it is compared to a variety of alternative clustering approaches. When comparing the distributions of clusters for the purpose of evaluation, the Standard Error (SE), root mean squared error (RMSE), and corrected R squared error are all computed.

Keywords: Meta-heuristic; Internet of Things; Data selection; K-Mean Clustering; K-Medoid; Genetic Algorithm.