Real-Time Gesture Recognition Using Attention-Based CNN-RNN Framework for Human-Robot Interaction

 

 

 

R. Poorni1,*, Chinnathambi Kamatchi2, Y. Dharshan3, K. Kowsalya4, R. Vijay5, M. Balakrishnan6

 

1Assistant Professor, School of Computer Science Engineering, SRM Institute of Science and Technology, Ramapuram, Chennai, Tamilnadu, India

 

2Assistant Professor, Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India

 

3Assistant Professor, Department of Electronics and Instrumentation Engineering, Sri Ramakrishna Engineering College, Coimbatore, Tamil Nadu, India

 

4Assistant Professor, Department of Electronics and Communication Engineering, Hindusthan Institute of Technology, Coimbatore, Tamil Nadu, India

 

5Assistant Professor, Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation (Deemed to be University), Andhra Pradesh, India

 

6Professor, Department of Artificial Intelligence and Data Science, Dr. Mahalingam College of Engineering and Technology, Pollachi, Coimbatore, Tamil Nadu, India

 

Emails: Poorniram21@gmail.com; k.chinnathambimku@gmail.com; dharshan.y@srec.ac.in; kowsalya.k@hit.edu.in; vijayraja4398@gmail.com; balakrishnanme@gmail.com

 

 

Abstract

Gesture recognition serves as a key enabler for natural and intuitive human–robot interaction (HRI) in smart automation and assistive systems. However, achieving real-time performance with high recognition accuracy remains a significant challenge due to dynamic background variations, occlusion, and complex spatio-temporal dependencies in gesture sequences. This paper presents a real-time attention-based CNN-RNN framework for robust gesture recognition and adaptive HRI in dynamic environments. The proposed system utilizes Convolutional Neural Networks (CNNs) for spatial feature extraction from sequential video frames and Bidirectional Recurrent Neural Networks (BiRNNs)—integrated with an attention mechanism—for modeling temporal dependencies and focusing on discriminative motion cues. The attention layer enhances interpretability by prioritizing salient gestures and reducing background noise. A hybrid optimization strategy, combining adaptive learning rate scheduling and regularized dropout, ensures computational stability and generalization across gesture datasets. Experiments conducted on benchmark datasets such as NVIDIA Dynamic Gesture (NvGesture) and ChaLearn IsoGD demonstrate superior performance, achieving an accuracy of 97.8% and a real-time inference speed of 34 FPS, outperforming baseline CNN, 3D-CNN, and LSTM architectures. The proposed framework effectively balances accuracy, latency, and interpretability, making it suitable for real-world HRI applications, including service robotics, industrial automation, and assistive technologies.

 

 

 

 

Received: January 10, 2025 Revised: February 24, 2025 Accepted: March 30, 2025

 

Keywords: Gesture recognition; human–robot interaction (HRI); convolutional neural network (CNN); recurrent neural network (RNN); attention mechanism; bidirectional RNN; spatio-temporal modelling; real-time processing; deep learning; intelligent robotics