Home > Research > Publications & Outputs > Cross-Modal Contrastive Pre-training for Few-Sh...

Electronic data

  • Cross-Modal Contrastive

    Accepted author manuscript, 3.64 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published
  • Mingqi Lu
  • Siyuan Yang
  • Xiaobo Lu
  • Jun Liu
Close
<mark>Journal publication date</mark>31/10/2024
<mark>Journal</mark>IEEE Transactions on Circuits and Systems for Video Technology
Issue number10
Volume34
Pages (from-to)9798-9807
Publication StatusPublished
<mark>Original language</mark>English

Abstract

This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.