Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition - 3IA Côte d’Azur – Interdisciplinary Institute for Artificial Intelligence Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition

Résumé

Action recognition based on human pose has witnessed increasing attention due to its robustness to changes in appearances, environments, and viewpoints. Despite associated progress, one remaining challenge has to do with occlusion in real-world videos that hinders the visibility of all joints. Such occlusion impedes representation of such scenes by models that have been trained on full-body pose data, obtained in laboratory conditions with specific sensors. To address this, as a first contribution, we introduce OR-VPE, a novel video pose embedding network that is streamlined to learn an occlusionrobust representation for pose sequences in videos. In order to enable our embedding network to handle partially visible joints, we propose to incorporate a sub-graph data augmentation mechanism during training, which simulates occlusions, into a video pose encoder based on Graph Convolutional Networks (GCNs). As a second contribution, we apply a contrastive learning module to train the video pose representation in a selfsupervised manner without the necessity of action annotations. This is achieved by minimizing the mutual information of the same pose sequence pruned into different spatio-temporal subgraphs. Experimental analyses show that compared to training the same encoder from scratch, our proposed OR-VPE, with pre-training on a large-scale dataset, NTU-RGB+D 120, improves the performance of the downstream action classification on Toyota Smarthome, N-UCLA and Penn Action datasets.
Fichier principal
Vignette du fichier
FG2021_0008.pdf (1.82 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03476564 , version 1 (13-12-2021)

Identifiants

Citer

Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero Francesca, et al.. Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition. FG 2021 - IEEE International Conference on Automatic Face and Gesture Recognition, Dec 2021, Jodhpur (Virtual), India. ⟨10.1109/FG52635.2021.9667032⟩. ⟨hal-03476564⟩
103 Consultations
208 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More