Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

Home > Research > Publications & Outputs > Cross-Modal Contrastive Pre-training for Few-Sh...

Computing and Communications

Electronic data

Cross-Modal Contrastive
Accepted author manuscript, 3.64 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/TCSVT.2024.3402952
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Mingqi Lu
Siyuan Yang
Xiaobo Lu
Jun Liu

More...

<mark>Journal publication date</mark>	31/10/2024
<mark>Journal</mark>	IEEE Transactions on Circuits and Systems for Video Technology
Issue number	10
Volume	34
Pages (from-to)	9798-9807
Publication Status	Published
<mark>Original language</mark>	English

Abstract

This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.

Research

Electronic data

Links

Text available via DOI:

Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us