298 263
Full Length Article
Journal of Intelligent Systems and Internet of Things
Volume 10 , Issue 2, PP: 90-101 , 2023 | Cite this article as | XML | Html |PDF

Title

3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality

  Enas Kh. Hassan 1 * ,   Jamila Harbi S. 2

1  Remote Sensing and GIS Department, College of Science, University of Baghdad, Baghdad, Iraq
    (enas.mkhazal@gmail.com)

2  Computer Science Department, College of Science, Mustansiriyah University, Baghdad, Iraq
    (dr.jameelahharbi@gmail.com)


Doi   :   https://doi.org/10.54216/JISIoT.100208

Received: April 22, 2023 Revised: July 13, 2023 Accepted: October 11, 2023

Abstract :

In the realm of Human-Computer Interaction (HCI), the importance of hands cannot be overstated. Hands serve as a fundamental means of communication, expression, and interaction in the physical world. In recent years, Augmented Reality (AR) has emerged as a next-generation technology that seamlessly merges the digital and physical worlds, providing transformative experiences across various domains. In this context, accurate hand pose and shape estimation plays a crucial role in enabling natural and intuitive interactions within AR environments. Augmented Reality, with its ability to overlay digital information onto the real world, has the potential to revolutionize how we interact with technology. From gaming and education to healthcare and industrial training, AR has opened up new possibilities for enhancing user experiences. This study proposes an innovative approach for hand pose and shape estimation in AR applications. The methodology commences with the utilization of a pre-trained Single Shot Multi-Box (SSD) model for hand detection and cropping. The cropped hand image is then transformed into the HSV color model, followed by applying histogram equalization on the value band. To precisely isolate the hand, specific bounds are set for each band of the HSV color space, generating a mask. To refine the mask and diminish noise, contouring techniques are applied to the mask, and gap-filling methods are employed. The resultant refined mask is then combined with the original cropped image through logical AND operations to accurately delineate the hand boundaries. This meticulous approach ensures robust hand detection even in complex scenes. To extract pertinent features, the detected hand undergoes two concurrent processes. Firstly, the Scale-Invariant Feature Transform (SIFT) algorithm identifies distinctive keypoints on the hand's outer surface. Simultaneously, a pre-trained lightweight Convolutional Neural Network (CNN), namely MobileNet, is employed to extract 3D hand landmarks, the hand's center (middle finger metacarpophalangeal joint), and handedness information. These extracted features, encompassing hand keypoints, landmarks, center, and handedness, are aggregated and compiled into a CSV file for further analysis. A Gated Recurrent Unit (GRU) is then employed to process the features, capturing intricate dependencies between them. The GRU model successfully predicts the 3D hand pose, achieving high accuracy even in dynamic scenarios. The evaluation results for the proposed model are very promising that the Mean Per Joint Position Error in 3D (MPJPE) is 0.0596 between the predicted pose and the ground truth hand landmarks, while the Percentage of Correct Keypoints (PCK) is 95%. Upon predicting the hand pose, a mesh representation is employed to reconstruct the 3D shape of the hand. This mesh provides a tangible representation of the hand's structure and orientation, enhancing the realism and usability of the AR application. By integrating sophisticated detection, feature extraction, and predictive modeling techniques, this method contributes to creating more immersive and intuitive AR experiences, thereby fostering the seamless fusion of the digital and physical worlds.

Keywords :

Pose estimation; Shape Estimation; Hand; 3D space; HSV; SIFT; GRU; MANO

References :

[1] Liang, H., Yuan, J., Lee, J., Ge, L. and Thalmann, D., 2017. Hough forest with optimized leaves for global hand pose estimation with arbitrary postures. IEEE Transactions on Cybernetics, 49(2), pp.527-541.

[2] Obeid, N. (2023). On The Product and Ratio of Pareto and Erlang Random Variables. International Journal of Mathematics, Statistics, and Computer Science, 1, 33–47. https://doi.org/10.59543/ijmscs.v1i.7737

[3] Spurr, A., Song, J., Park, S. and Hilliges, O., 2018. Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 89-98).

[4] Zimmermann, C. and Brox, T., 2017. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision (pp. 4903-4911).

[5] Ge, L., Cai, Y., Weng, J. and Yuan, J., 2018. Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8417-8426).

[6] Ge, L., Liang, H., Yuan, J. and Thalmann, D., 2018. Robust 3D hand pose estimation from single depth images using multi-view CNNs. IEEE Transactions on Image Processing, 27(9), pp.4422-4436.

[7] Ge, L., Ren, Z. and Yuan, J., 2018. Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European conference on computer vision (ECCV) (pp. 475-491).

[8] Malik, J., Elhayek, A., Nunnari, F., Varanasi, K., Tamaddon, K., Heloir, A. and Stricker, D., 2018, September. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In 2018 International Conference on 3D Vision (3DV) (pp. 110-119). IEEE.

[9] Panteleris, P., Oikonomidis, I. and Argyros, A., 2018, March. Using a single rgb frame for real time 3d hand pose estimation in the wild. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 436-445). IEEE.

[10] Ge, L., Liang, H., Yuan, J. and Thalmann, D., 2018. Real-time 3D hand pose estimation with 3D convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 41(4), pp.956-970.

[11] Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A. and Fitzgibbon, A., 2014. User-specific hand modeling from monocular depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 644-651).

[12] Boukhayma, A., Bem, R.D. and Torr, P.H., 2019. 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10843-10852).

[13] Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J. and Yuan, J., 2019. 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10833-10842).

[14] Guo, S., Rigall, E., Qi, L., Dong, X., Li, H. and Dong, J., 2020. Graph-based CNNs with self-supervised module for 3D hand pose estimation from monocular RGB. IEEE Transactions on Circuits and Systems for Video Technology, 31(4), pp.1514-1525.

[15] Cai, Y., Ge, L., Cai, J., Thalmann, N.M. and Yuan, J., 2020. 3D hand pose estimation using synthetic data and weakly labeled RGB images. IEEE transactions on pattern analysis and machine intelligence, 43(11), pp.3739-3753.

[16] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37). Springer International Publishing.

[17] Hema, D. and Kannan, D.S., 2019. Interactive color image segmentation using HSV color space. Sci. Technol. J, 7(1), pp.37-41.

[18] Hassan, E.K. and Saud, J.H., 2023, February. HSV color model and logical filter for human skin detection. In AIP Conference Proceedings (Vol. 2457, No. 1). AIP Publishing.

[19] Arbelaez, P., Maire, M., Fowlkes, C. and Malik, J., 2010. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5), pp.898-916.

[20] C. Bond, "An Efficient and Versatile Flood Fill Algorithm for Raster Scan Displays," 2011.

[21] Wu, J., Cui, Z., Sheng, V.S., Zhao, P., Su, D. and Gong, S., 2013. A Comparative Study of SIFT and its Variants. Measurement science review, 13(3), pp.122-131.

[23] Abbas, S.K. and George, L.E., 2020. The Performance Differences between Using Recurrent Neural Networks and Feedforward Neural Network in Sentiment Analysis Problem. Iraqi Journal of Science, 61(6).

[24] Romero, J., Tzionas, D. and Black, M.J., 2022. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610.

[25] Habibie, I., Xu, W., Mehta, D., Pons-Moll, G. and Theobalt, C., 2019. In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10905-10914).

[26] Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C. and Murphy, K., 2017. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4903-4911).

[27] Sharma, R.P. and Verma, G.K., 2015. Human computer interaction using hand gesture. Procedia Computer Science, 54, pp.721-727.

[28] Aliprantis, J., Konstantakis, M., Nikopoulou, R., Mylonas, P. and Caridakis, G., 2019, January. Natural Interaction in Augmented Reality Context. In VIPERC@ IRCDL (pp. 50-61).

[29] Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D. and Theobalt, C., 2018. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 49-59).

[30] Abbas, A.H., Arab, A. and Harbi, J., 2018. Image compression using principal component analysis. Mustansiriyah Journal of Science, 29(2), p.01854.

[31] Abbas, A.H., 2011. Mathematical Morphology Operations on Grayscale Image. Journal of the College of Basic Education, 17(67), pp.105-115.

[32] Abdullah, R.M., Alazawi, S.A.H. and Ehkan, P., 2023. SAS-HRM: Secure Authentication System for Human Resource Management. Al-Mustansiriyah Journal of Science, 34(3), pp.64-71.

[22] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. and Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.


Cite this Article as :
Style #
MLA Enas Kh. Hassan, Jamila Harbi S.. "3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality." Journal of Intelligent Systems and Internet of Things, Vol. 10, No. 2, 2023 ,PP. 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)
APA Enas Kh. Hassan, Jamila Harbi S.. (2023). 3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality. Journal of Journal of Intelligent Systems and Internet of Things, 10 ( 2 ), 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)
Chicago Enas Kh. Hassan, Jamila Harbi S.. "3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality." Journal of Journal of Intelligent Systems and Internet of Things, 10 no. 2 (2023): 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)
Harvard Enas Kh. Hassan, Jamila Harbi S.. (2023). 3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality. Journal of Journal of Intelligent Systems and Internet of Things, 10 ( 2 ), 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)
Vancouver Enas Kh. Hassan, Jamila Harbi S.. 3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality. Journal of Journal of Intelligent Systems and Internet of Things, (2023); 10 ( 2 ): 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)
IEEE Enas Kh. Hassan, Jamila Harbi S., 3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality, Journal of Journal of Intelligent Systems and Internet of Things, Vol. 10 , No. 2 , (2023) : 90-101 (Doi   :  https://doi.org/10.54216/JISIoT.100208)