ALSEHAIM, AISHAH,ABDULRAHMAN (2023) Video Person Re-identification for Future Automated Visual Surveillance Systems. Doctoral thesis, Durham University.
| PDF 13Mb |
Abstract
Person Re-identification across a collection of surveillance cameras is becoming an in- creasingly vital component of smart intelligent surveillance systems. Due to the numerous variations in human position, occlusion, viewpoint, illumination and background clut- ter most contemporary video Re-ID studies ( in order to extract spatio-temporal video features ) use complex CNN-based network architectures with 3D convolution or multi- branch networks. In this thesis, we intend to leverage the significant challenge given by person Re-ID by encoding person videos into a robust discriminative feature vector to im- prove performance under these challenging settings. The extraction of strong and discrim- inative features is a fundamental aspect of person Re-ID such that CNN-based approaches have dominated in this area. We show that a simple single-stream 2D convolution network using the ResNet50-IBN architecture to extract frame-level features can achieve superior performance when combined with temporal attention for clip-level features. By averag- ing, these features can be generalised to extract features from entire videos without added expense. While other recent work uses complicated and memory-intensive 3D convolu- tions or multi-stream networks architectures, our method uses both video Re-ID best prac- tice and transfer learning between datasets to achieve superior outcomes for person Re-ID. Moreover, we consider the task of joint person Re-ID and action recognition within the context of automated surveillance to learn discriminative feature representations that both improve Re-ID performance and are capable of providing viable per-view (clip-wise) ac- tion recognition. Weakly labelled actions from the leading two benchmark video Re-ID datasets (MARS, LPW) are used to perform a hybrid Re-ID and action recognition task utilising a mixture of two task-specific and multi-loss terms. Our multi-branch 2D CNN architecture achieves superior results to previous work in the field solely because we treat Re-ID and action detection as multi-task problem. Recently, vision transformer (ViT) ar- chitectures have been shown to boost fine-grained feature discrimination across a variety of vision tasks. To enable vision transformer (ViT) for video person Re-ID, two unique module constructions, Temporal Clip Shift and Shuffled (TCSS) and Video Patch Part Feature (VPPF), are proposed to enable ViT architectures to effectively meet the chal- lenges of video person Re-ID. Overall, we present three novel deep learning architectures that address the video person Re-ID task spanning the use of CNN, multi-task learning and ViT approaches.
Item Type: | Thesis (Doctoral) |
---|---|
Award: | Doctor of Philosophy |
Keywords: | Re-ID |
Faculty and Department: | Faculty of Science > Computer Science, Department of |
Thesis Date: | 2023 |
Copyright: | Copyright of this thesis is held by the author |
Deposited On: | 11 Aug 2023 12:50 |