Home > Research > Publications & Outputs > M3Net: Multi-view Encoding, Matching, and Fusio...

Links

Text available via DOI:

View graph of relations

M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. / Tang, Hao; Liu, Jun; Yan, Shuanglin et al.
MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition. New York: ACM, 2023. p. 1719-1728.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Tang, H, Liu, J, Yan, S, Yan, R, Li, Z & Tang, J 2023, M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. in MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition. ACM, New York, pp. 1719-1728. https://doi.org/10.1145/3581783.3612221

APA

Tang, H., Liu, J., Yan, S., Yan, R., Li, Z., & Tang, J. (2023). M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. In MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition (pp. 1719-1728). ACM. https://doi.org/10.1145/3581783.3612221

Vancouver

Tang H, Liu J, Yan S, Yan R, Li Z, Tang J. M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. In MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition. New York: ACM. 2023. p. 1719-1728 doi: 10.1145/3581783.3612221

Author

Tang, Hao ; Liu, Jun ; Yan, Shuanglin et al. / M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition. New York : ACM, 2023. pp. 1719-1728

Bibtex

@inproceedings{b792fc8171b548e4b1b088a226508093,
title = "M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition",
abstract = "Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.",
author = "Hao Tang and Jun Liu and Shuanglin Yan and Rui Yan and Zechao Li and Jinhui Tang",
year = "2023",
month = oct,
day = "26",
doi = "10.1145/3581783.3612221",
language = "English",
pages = "1719--1728",
booktitle = "MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition",
publisher = "ACM",

}

RIS

TY - GEN

T1 - M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

AU - Tang, Hao

AU - Liu, Jun

AU - Yan, Shuanglin

AU - Yan, Rui

AU - Li, Zechao

AU - Tang, Jinhui

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

AB - Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

U2 - 10.1145/3581783.3612221

DO - 10.1145/3581783.3612221

M3 - Conference contribution/Paper

SP - 1719

EP - 1728

BT - MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition

PB - ACM

CY - New York

ER -