Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition
AU - Tang, Hao
AU - Liu, Jun
AU - Yan, Shuanglin
AU - Yan, Rui
AU - Li, Zechao
AU - Tang, Jinhui
PY - 2023/10/26
Y1 - 2023/10/26
N2 - Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.
AB - Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.
U2 - 10.1145/3581783.3612221
DO - 10.1145/3581783.3612221
M3 - Conference contribution/Paper
SP - 1719
EP - 1728
BT - MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition
PB - ACM
CY - New York
ER -