Volume 15 , Issue 1 , PP: 250-261, 2024 | Cite this article as | XML | Html | PDF | Full Length Article
A. Madhuri 1 * , T. Umadevi 2
Doi: https://doi.org/10.54216/FPA.150120
In this work, we rethink the phonetic growing experience in scene message recognition and abandon the broadly acknowledged complex language model. We present a Visual Language Displaying Organization (Vision LAN), which considers the visual and etymological data as an association by straightforwardly enriching the vision model with language capacities, rather than prior strategies that look at the visual and semantic data in two free designs. Specifically, we present person shrewd impeded highlight map message recognition in the preparation stage. At the point when visual prompts (like impediment, commotion, and so on) are perplexed, this activity guides the vision model to utilize both the visual surface of the characters and the phonetic data in the visual setting for recognition. To improve the performance of visual language models devoted to item identification and recognition in irregular scene images, the abstract investigates the critical function that context plays. Distinguished by intricate and ever-changing visual components, irregular sceneries pose distinct difficulties for conventional computer vision systems.
Visual Language Models , Recognition , Detection , Irregular.
[1] Heng, H., Li, P., Guan, T., & Yang, T. (2023). Scene text recognition via context modeling for low-quality image in logistics industry. Complex & Intelligent Systems, 9(3), 3229-3248.
[2] Li, L., Xiao, J., Chen, G., Shao, J., Zhuang, Y., & Chen, L. (2023). Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models. arXiv preprint arXiv:2305.12476.
[3] Li, M., Lv, T., Chen, J., Cui, L., Lu, Y.,
[4] Florencio, D., ... & Wei, F. (2023, June). Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 13094-13102)
[5] Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., ... & Qiao, Y. (2023). Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575
[6] Liu, Y., Kong, F., Xu, M., Silamu, W., & Li, Y. (2023). Scene Uyghur Recognition Based on Visual Prediction Enhancement. Sensors, 23(20), 8610.
[7] Liu, Y., Li, Z., Li, H., Yu, W., Huang, M., Peng,D & Bai, X. (2023). On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
[8] Lu, J., Zhang, D., Wu, X., Gao, X., Gan, R.,Zhang, J., ... & Zhang, P. (2023). Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning. arXiv preprint arXiv:2310.08166.
[9] Park, S. M., & Kim, Y. G. (2023). Visual language navigation: A survey and open challenges. Artificial Intelligence Review, 56(1), 365-427.
[10] Peelen, M. V., Berlot, E., & de Lange, F. P. (2023). Predictive processing of scenes and objects. Nature Reviews Psychology, 1-14.
[11] Prabu, S., & Abraham Sundar, K. J. (2023). Enhanced Attention-Based Encoder-Decoder Framework for Text Recognition. Intelligent Automation & Soft Computing, 35(2).
[12] Wen, L., Yang, X., Fu, D., Wang, X., Cai, P., Li,X., .. & Shi, B. (2023). On the road with GPT- 4V (ision): Early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332.
[13] JayaLakshmi, G., Madhuri, A., Vasudevan, D., Thati, B., Sirisha, U., Praveen, S.P. (2023). Effective disaster management through transformer-based multimodal tweet classification. Revue d'Intelligence Artificielle, Vol. 37, No. 5, pp. 1263-1272. https://doi.org/10.18280/ria.370519
[14] Arava, K., Paritala, C., Shariff, V., Praveen, S. P., & Madhuri, A. (2022, August). A Generalized Model for Identifying Fake Digital Images through the Application of Deep Learning. In 2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 1144-1147). IEEE.
[15] V. Vankadaru, P. N. Srinivasu, S. H. H. Prasad, P. Rohit, P. R. Babu, and M. D. C. Raju, "Text Identification from Handwritten Data using Bi- LSTM and CNN with FastAI," 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA), Uttarakhand, India, 2023, pp. 215- 220, doi: 10.1109/ICIDCA56705.2023.10099715.
[16] Marrapu, B. V., Raju, K. Y. N., Chowdary, M. J., Vempati, H., & Praveen, S. P. (2022, January). Automating the creation of machine learning algorithms using basic math. In 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 866-871). IEEE.
[17] Bengamra, S., Mzoughi, O., Bigand, A., &Zagrouba, E. (2023). A comprehensive survey on object detection in Visual Art: taxonomy and challenge. Multimedia Tools and Applications, 1- 34.
[18] Cao, Y., Xu, X., Sun, C., Huang, X., & Shen, W. (2023). Towards generic anomaly detection and understanding: Large-scale visual-linguistic model (gpt-4v) takes the lead. arXiv preprint arXiv:2311.02782
[19] Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., ... & Zheng, C. (2024). A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
[20] Elhafsi, A., Sinha, R., Agia, C., Schmerling, E., Nesnas, I. A., & Pavone, M. (2023). Semantic anomaly detection with large language models. Autonomous Robots, 47(8), 1035-1055.