QIAO, TANQIU (2025) Geometry-Informed Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos. Doctoral thesis, Durham University.
![]()
| PDF - Accepted Version Available under License Creative Commons Attribution Non-commercial 3.0 (CC BY-NC). 4Mb |
Abstract
Human-Object Interaction (HOI) recognition in videos is a fundamental task in computer vision with wide-ranging applications, including robotics, surveillance, and autonomous systems. Accurately modeling the complex interactions between multiple humans and objects in dynamic environments is crucial for developing intelligent systems that can understand and recognize human behavior.
HOI recognition in multi-person scenarios presents unique challenges that surpass traditional action recognition and single-person HOI tasks. With multiple individuals inter- acting simultaneously with various objects, complexities such as occlusions and overlapping interactions become prevalent. Video-based analysis is crucial, as static images fail to capture the temporal dynamics necessary for understanding these interactions. To tackle these challenges, integrating geometric cues like human poses and object keypoints with visual features such as appearance and motion is essential. Geometric understanding is inherently more robust to occlusions and can provide additional spatial information that visual features alone may miss. The primary aim of this research is to develop a robust and accurate multi-person HOI recognition framework that effectively fuses geometric and visual features, addressing these complexities through three objectives: (1) designing advanced multimodal feature fusion methods, (2) collecting comprehensive multi-person HOI datasets, and (3) creating a generalizable framework suited for diverse scenarios.
The motivation behind this research direction stems from the limitations of current visual-based approaches, which often fail to generalize in complex real-world scenarios. Extracting geometric is inspired by skeleton-based action recognition, as they are less affected by challenges like partial occlusions. Effective fusion of geometric and visual features is critical for creating a holistic representation that enhances the model’s understanding of interactions. Additionally, the success of this framework hinges on the availability of high-quality datasets that reflect the diversity of real-world MPHOI situations. Therefore, we also collect multi-person HOI datasets that not only aid in training and validating the proposed model but also contribute to the broader research community. This comprehensive approach ensures that our framework is well-equipped to handle the intricate nature of MPHOI recognition in dynamic video environments.
This research introduces a series of novel frameworks designed to enhance the robustness and accuracy of multi-person HOI recognition in videos. We start with the Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN), the first attempt to complement visual features with geometric features learned from geometric understanding via graph-based deep learning methods. We also introduce MPHOI-72, a novel two- person HOI dataset specifically designed to evaluate the effectiveness of 2G-GCN in multi-person HOI scenarios, thereby advancing the field from single-person to multi-person HOI recognition.
Building on the insight from 2G-GCN that the geometric cues offer extensive comple- mentary information, the need for a more effective fusion of geometric and visual features is identified. We propose the CATS framework to advance HOI recognition from category- level to scenery-level understanding. This framework fuses geometric and visual features for each human and object category, and subsequently constructs a scenery interactive graph to learn the relationships among these categories, providing a more structured and comprehensive understanding of the interactions within a scene.
Recognizing the need for further improvements in multimodal feature fusion and dynamic interaction modeling, we propose the Geometric Visual Fusion Graph Neural Networks (GeoVis-GNN). It further refines the fusion of geometric and visual features at the entity level via a dual-attention mechanism and enhances HOI modeling by an interdependent entity graph. To better represent realistic multi-person HOI scenarios, we introduce MPHOI-120, a challenging dataset collecting three-person HOI activities with frequent occlusions and exponentially increasing interaction complexity.
We validate the effectiveness of our methods through extensive experiments and quali- tative analysis, demonstrating that our approaches outperform state-of-the-art techniques in HOI recognition across both multi-person and single-person scenarios in videos.
Item Type: | Thesis (Doctoral) |
---|---|
Award: | Doctor of Philosophy |
Keywords: | Human-Object Interaction;geometric understanding;multimodal fusion;graph neural networks;deep learning;computer vision |
Faculty and Department: | Faculty of Science > Computer Science, Department of |
Thesis Date: | 2025 |
Copyright: | Copyright of this thesis is held by the author |
Deposited On: | 21 Feb 2025 09:12 |