Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm

Suleman Alnatheer; Mohammed Altaf Ahmed

doi:https://doi.org/10.54216/JISIoT.180219

Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm

Suleman Alnatheer ^{1
*} , Mohammed Altaf Ahmed ²

1 Department of Computer Engineering, College of Computer Engineering & Sciences, Prince Sattam bin Abdulaziz University, Alkharj-11942, Saudi Arabia - (Suleman Alnatheer )

2 Department of Computer Engineering, College of Computer Engineering & Sciences, Prince Sattam bin Abdulaziz University, Alkharj-11942, Saudi Arabia - (m.altaf@psau.edu.sa)

Doi: https://doi.org/10.54216/JISIoT.180219

Received: March 18, 2025 Revised: June 10, 2025 Accepted: August 03, 2025

Abstract

In the image-captioning field, the excellence of produced captions is vital for the effectual interaction of visual content. Image Captioning is the main task, which unites computer vision (CV) and natural language processing (NLP), where it goals to produce graphic legends for images. A dual-fold procedure depends on precise image perception and alters language understanding both semantically and syntactically. It is gradually challenging to stay up with the modern study and consequences in image captioning owing to the developing amount of knowledge accessible on the topic. This analysis examines into deep learning (DL) to tackle the tasks challenged by individuals with graphic impairments, targeting to improve their visual insight via advanced technologies. By tradition, the visually impaired have trusted physical support and adaptive helps for understanding and navigating visual content. With the beginning of DL, there is a unique chance to develop this scenery. In this paper, we offer an Advanced Deep Learning Method for Image Captioning Based Using Customized Transformer with a Global Optimization Algorithm (ADLIC-CTGOA). The foremost aim of ADLIC-CTGOA model is to focus on the initiation of the effectual textual image captioning of an input image. Initially, the ADLIC-CTGOA method employs preprocessing phase to enhances both image and text data: images undergo noise removal and contrast enhancement to improve quality, while text is processed by removing numbers, converting to lowercase, and text vectorization. Next, the customized swin transformer is employed for feature extraction to capture fine-grained visual features from images. In addition, the BERT Transformer model is deployed for image captioning process. To enhance the performance of proposed technique, the chaotic Aquila optimization (CAO) technique was applied for parameter tuning for enhancing the performance. A wide sort of simulation studies are executed to ensure the improved performance of ADLIC-CTGOA system. The comparative result exploration reported the betterment of the ADLIC-CTGOA model on recent approaches in terms of different evaluation measures.

Keywords :

Image Captioning , Global Optimization Algorithm , Swin Transformer , BERT , Natural Language Processing , Deep Learning

References

[1] Deorukhkar, K.P. and Ket, S., "Image Captioning using Hybrid LSTM-RNN with Deep Features," Sensing and Imaging, vol. 23, no. 1, p. 31, 2022.

[2] Liu, A.A., Zhai, Y., Xu, N., Nie, W., Li, W. and Zhang, Y., "Region-aware image captioning via interactive learning," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3685–3696, 2021.

[3] Tiwary, T. and Mahapatra, R.P., "An accurate generation of image captions for blind people using an extended convolutional atom neural network," Multimedia Tools and Applications, pp. 1–30, 2022.

[4] G. Geetha, T. Kirthigadevi, G. G. Ponsam, T. Karthik, and M. Safa, "Image captioning using deep convolutional neural networks (CNNs)," J. Phys., Conf. Ser., vol. 1712, no. 1, Art. no. 012015, Dec. 2020.

[5] S. Srivastava, H. Sharma, and P. Dixit, "Image captioning based on deep convolutional neural networks and LSTM," in Proc. 2nd Int. Conf. Power Electron. IoT Appl. Renew. Energy Control (PARC), Jan. 2022, pp. 1–4.

[6] Hossain, M.Z., Sohel, F., Shiratuddin, M.F. and Laga, H., "A comprehensive survey of deep learning for image captioning," ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019.

[7] Elhagry, A. and Kadaoui, K., "A thorough review of recent deep learning methodologies for image captioning," arXiv preprint arXiv: 2107.13114, 2021.

[8] S. Kalra and A. Leekha, "Survey of convolutional neural networks for image captioning," J. Inf. Optim. Sci., vol. 41, no. 1, pp. 239–260, Jan. 2020.

[9] R. Li, H. Liang, Y. Shi, F. Feng, and X. Wang, "Dual-CNN: A convolutional language decoder for paragraph image captioning," Neurocomputing, vol. 396, pp. 92–101, Jul. 2020.

[10] Mishal, S.M. and Hamad, M.M., "Text Classification Using Convolutional Neural Networks," 2022.

[11] Sangolgi, V.A., Patil, M.B., Vidap, S.S., Doijode, S.S., Mulmane, S.Y. and Vadaje, A.S., "Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques," Procedia Computer Science, vol. 233, pp. 547–557, 2024.

[12] Safiya, K.M. and Pandian, R., "Computer Vision and Voice Assisted Image Captioning Framework for Visually Impaired Individuals Using Deep Learning Approach," in 2023 4th IEEE Global Conference for Advancement in Technology (GCAT), Oct. 2023, pp. 1–7.

[13] Bayisa, L.Y., Wang, W., Wang, Q., Ukwuoma, C.C., Gutema, H.K., Endris, A. and Abu, T., "Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning," International Journal of Machine Learning and Cybernetics, pp. 1–21, 2024.

[14] Solomon, R. and Abebe, M., "Amharic Language Image Captions Generation Using Hybridized Attention‐Based Deep Neural Networks," Applied Computational Intelligence and Soft Computing, vol. 2023, no. 1, p. 9397325, 2023.

[15] Wasi, A.A., Fahim, E.H., Inova, N.T., Fahim, A.A. and Preeti, T.T., "Hybrid recommendation system of intelligent captioning using deep learning networks," Doctoral dissertation, Brac University, 2024.

[16] Safiya, K.M. and Pandian, R., "Real-Time Photo Captioning for Assisting Blind and Visually Impaired People Using LSTM Framework," IEEE Sensors Letters, vol. 7, no. 11, pp. 1–4, 2023.

[17] Cao, X., Zhao, Y. and Li, X., "Optimizing image captioning algorithm to facilitate English writing," Education and Information Technologies, vol. 29, no. 1, pp. 1033–1055, 2024.

[18] Kim, G.Y., Oh, B.D., Kim, C. and Kim, Y.S., "Convolutional neural network and language model-based sequential CT Image captioning for intracerebral hemorrhage," Applied Sciences, vol. 13, no. 17, p. 9665, 2023.

[19] Nandan, D., Kanungo, J. and Mahajan, A., "An error-efficient Gaussian filter for image processing by using the expanded operand decomposition logarithm multiplication," Journal of ambient intelligence and humanized computing, pp. 1–8, 2024.

[20] Mahdi, T.F. and Daway, H.G., "MRI Image Enhancement Using Multilevel Image Thresholds Based on Contrast-limited Adaptive Histogram Equalization," Iraqi Journal of Science, 2024.

[21] Rasheed, F., Anwar, M. and Khan, I., "Detecting cyberbullying in Roman Urdu language using natural language processing techniques," Pakistan Journal of Engineering and Technology, vol. 5, no. 2, pp. 198–203, 2022.

[22] Pascal, I., "A novel Swin transformer approach utilizing residual multi-layer perceptron for diagnosing brain tumors in MRI images," International Journal of Machine Learning and Cybernetics, pp. 1–19, 2024.

[23] Aurpa, T.T. and Ahmed, M.S., "An ensemble novel architecture for Bangla Mathematical Entity Recognition (MER) using transformer-based learning," Heliyon, vol. 10, no. 3, 2024.

[24] Mahdi, M.A., Fati, S.M., Hazber, M.A., Ahamad, S. and Saad, S.A., "Enhancing Arabic Cyberbullying Detection with End-to-End Transformer Model," CMES-Computer Modeling in Engineering & Sciences, vol. 141, no. 2, 2024.

[25] Gopi, S. and Mohapatra, P., "Chaotic Aquila Optimization algorithm for solving global optimization and engineering problems," Alexandria Engineering Journal, vol. 108, pp. 135–157, 2024.

[26] "Flickr8k Dataset". [Online]. Available: https://www.kaggle.com/datasets/adityajn105/flickr8k.

[27] "Flickr Image Dataset". [Online]. Available: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset.

[28] Arasi, M.A., Alshahrani, H.M., Alruwais, N., Motwakel, A., Ahmed, N.A. and Mohamed, A., "Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model," IEEE Access, 2023.

[29] Alnashwan, R.O., Chelloug, S.A., Almalki, N., Issaoui, I., Motwakel, A. and Sayed, A., "Lighting Search Algorithm With Convolutional Neural Network-Based Image Captioning System for Natural Language Processing," IEEE Access, 2023.

Cite This Article As :

Alnatheer, Suleman. , Altaf, Mohammed. Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm. Journal of Intelligent Systems and Internet of Things, vol. , no. , 2026, pp. 273-289. DOI: https://doi.org/10.54216/JISIoT.180219

Alnatheer, S. Altaf, M. (2026). Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm. Journal of Intelligent Systems and Internet of Things, (), 273-289. DOI: https://doi.org/10.54216/JISIoT.180219

Alnatheer, Suleman. Altaf, Mohammed. Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm. Journal of Intelligent Systems and Internet of Things , no. (2026): 273-289. DOI: https://doi.org/10.54216/JISIoT.180219

Alnatheer, S. , Altaf, M. (2026) . Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm. Journal of Intelligent Systems and Internet of Things , () , 273-289 . DOI: https://doi.org/10.54216/JISIoT.180219

Alnatheer S. , Altaf M. [2026]. Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm. Journal of Intelligent Systems and Internet of Things. (): 273-289. DOI: https://doi.org/10.54216/JISIoT.180219

Alnatheer, S. Altaf, M. "Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm," Journal of Intelligent Systems and Internet of Things, vol. , no. , pp. 273-289, 2026. DOI: https://doi.org/10.54216/JISIoT.180219

Journal of Intelligent Systems and Internet of Things

Journal Menu

Journal Volumes

Volume 0

Volume 1

Volume 2

Volume 3

Volume 4

Volume 5

Volume 6

Volume 7

Volume 8

Volume 9

Volume 10

Volume 11

Volume 12

Volume 13

Volume 14

Volume 15

Volume 16

Volume 17

Volume 18

Advanced Deep Learning Model for Image Captioning Using Customized Vision Transformer with Global Optimization Algorithm

Abstract

Keywords :

References

Cite This Article As :

Article Statistics

Download