Volume 4 , Issue 2 , PP: 42-55, 2021 | Cite this article as | XML | Html | PDF | Full Length Article
Shitiz Gupta 1 * , Shubham Agnihotri 2 , Deepasha Birla 3 , Achin Jain 4 , Thavavel Vaiyapuri 5 , Puneet Singh Lamba 6
Doi: https://doi.org/10.54216/FPA.040202
Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.
Image Captioning, Transfer Learning, CNN (Convolutional Neural Network), RNN (Recurrent neural network)and LSTM (Long Short Term Memory).
[1] L. Fei-Fei, A. Iyer, C. Koch, P. Perona., What do we perceive in a glance of a real-world scene? J. Vis. 7 (1) (2007) 1–29.
[2] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1627–1645.
[3] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587.
[4] C.H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between class attribute transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 951–958.
[5] C. Gan, T. Yang, B. Gong, Learning attributes equals multi-source domain generalization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2016, pp. 87–97.
[6] L. Bourdev, J. Malik, S. Maji, Action recognition from a distributed representation of pose and appearance, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2011, pp. 3177–3184.
[7] Y.-W. Chao, Z. Wang, R. Mihalcea, J. Deng, Mining semantic affordances of visual object categories, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 4259–4267.
[8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the Twenty Fifth International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
[9] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2014, pp. 487–495.
[10] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 392–407.
[11] H. Goh, N. Thome, M. Cord, J. Lim, Learning deep hierarchical visual feature coding, IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2212–2225.
[12] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
[13] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: a deep convolutional activation feature for generic visual recognition, in: Proceedings of The Thirty First International Conference on Machine Learning, 2014, pp. 647–655.
[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv:1408.5093v1 (2014).
[15] N. Zhang, S. Ding, J. Zhang, Y. Xue, Research on point-wise gated deep networks, Appl. Soft Comput. 52 (2017) 1210–1221.
[16] J.P. Papa, W. Scheirer, D.D. Cox, Fine-tuning deep belief networks using harmony search, Appl. Soft Comput. 46 (2016) 875–885.
[17] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8).
[18] E.P. Ijjina, C.K. Mohan, Hybrid deep neural network model for human action recognition, Appl. Soft Comput. 46 (2016) 936–952.
[19] S. Wang, Y. Jiang, F.-L. Chung, P. Qian, Feedforward kernel neural networks, generalized least learning machine, and its deep learning with application to image classification, Appl. Soft Comput. 37 (2015) 125–141.
[20] S. Bai, Growing random forest on deep convolutional neural networks for scene categorization, Expert Syst. Appl. 71 (2017) 279–287.
[21] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv:1409.0473v7 (2016).
[22] K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using RNN encoder–decoder for statistical machine translation, arXiv:1406.1078v3 (2014).
[23] R. Collobert, J. Weston, A unified architecture for natural language processing:deep neuralnetworks with multitask learning, in: Proceedings of the Twenty Fifth International Conference on Machine Learning, 2008, pp. 160–167.
[24] A. Mnih, G. Hinton, Three new graphical models for statistical language modelling, in: Proceedings of the Twenty Fourth International Conference on Machine Learning, 2007, pp. 641648.
[25] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Proceedings of the Advances in Neural Information Processing Systems, 2013.
[26] M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics, J. Artif. Intell. Res. 47 (2013) 853–899.
[27] H. Fang, S. Gupta, F. Iandola, R. Srivastava, From captions to visual concepts and back., in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1473–1482.
[28] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
[29] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep captioning with multimodal recurrent neural networks, in: Proceedings of the International Conference on Learning Representation, 2015.
[30] M.R.R.M.S. L A Hendricks, S. Venugopalan, Deep compositional captioning: describing novel object categories without paired training data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1–10.
[31] A. Karpathy, A. Joulin, F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in: Proceedings of the Twenty Seventh Advances in Neural Information Processing Systems (NIPS), 3, 2014, pp. 1889–1897.
[32] L. Ma, Z. Lu, Lifeng, S.H. Li, Multimodal convolutional neural networks for matching image and sentences, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2623–2631.
[33] R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv:1411.2539(2018).
[34] Q.V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A.Y. Ng, Building highlevel features using large scale unsupervised learning, in: Proceedings of the International Conference on Machine Learning, 2012.
[35] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556v6 (2015).
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv:1409.4842 (2018).
[37] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network architectures for matching natural language sentences, in: Proceedings of the Twenty Seventh International Conference on Neural Information Processing Systems, 2014, pp. 2042–2050.
[38] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv:1404.2188v1 (2014).
[39] N. Kalchbrenner, P. Blunsom, Recurrent continuous translation models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013.
[40] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2014.
[41] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.
[42] R. A.Rensink, The dynamic representation of scenes, Vis. Cognit. 7 (1) (2000) 17–42.
[43] M. Spratling, M.H. Johnson, A feedback model of visual attention, J. Cognit. Neurosci. 16 (2) (2004) 219–237.
[44] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions, in: Proceedings of the Meeting on Association for Computational Linguistics, 2014, pp. 67–78.
[45] K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the Meeting on Association for Computational Linguistics, vol. 4 (2002).
[46] C.-Y. Lin , F.J. Och , Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in: Proceedings of the Meeting on Association for Computational Linguistics, 2004 .
[47] A . Lavie , A . Agarwal , METEOR: an automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 228–231 .
[48] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.
[49] Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. ”Going deeper with convolutions. arXiv 2014.” arXiv preprint arXiv:1409.4842 1409 (2014).
[50] Tan, Mingxing, and Quoc V. Le. ”Efficientnet: Rethinking model scaling for convolutional neural networks.” arXiv preprint arXiv:1905.11946 (2019).
[51] Karpathy, A., Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
[52] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057).
[53] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).