Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

Computing and Communications

Electronic data

Cross-Modal Contrastive
Accepted author manuscript, 3.64 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1109/TCSVT.2024.3402952
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition. / Lu, Mingqi; Yang, Siyuan; Lu, Xiaobo et al.
In: IEEE Transactions on Circuits and Systems for Video Technology, Vol. 34, No. 10, 31.10.2024, p. 9798-9807.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Lu, M, Yang, S, Lu, X & Liu, J 2024, 'Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition', IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9798-9807. https://doi.org/10.1109/TCSVT.2024.3402952

APA

Lu, M., Yang, S., Lu, X., & Liu, J. (2024). Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 34(10), 9798-9807. https://doi.org/10.1109/TCSVT.2024.3402952

Vancouver

Lu M, Yang S, Lu X, Liu J. Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology. 2024 Oct 31;34(10):9798-9807. doi: 10.1109/TCSVT.2024.3402952

Author

Lu, Mingqi ; Yang, Siyuan ; Lu, Xiaobo et al. / Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition. In: IEEE Transactions on Circuits and Systems for Video Technology. 2024 ; Vol. 34, No. 10. pp. 9798-9807.

Bibtex

@article{8726ce4520794b56a3b20ad4fdca22d4,

title = "Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition",

abstract = "This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.",

author = "Mingqi Lu and Siyuan Yang and Xiaobo Lu and Jun Liu",

year = "2024",

month = oct,

day = "31",

doi = "10.1109/TCSVT.2024.3402952",

language = "English",

volume = "34",

pages = "9798--9807",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "10",

}

RIS

TY - JOUR

T1 - Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

AU - Lu, Mingqi

AU - Yang, Siyuan

AU - Lu, Xiaobo

AU - Liu, Jun

PY - 2024/10/31

Y1 - 2024/10/31

N2 - This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.

AB - This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.

U2 - 10.1109/TCSVT.2024.3402952

DO - 10.1109/TCSVT.2024.3402952

M3 - Journal article

VL - 34

SP - 9798

EP - 9807

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

SN - 1051-8215

IS - 10

ER -

Research

Electronic data

Links

Text available via DOI: