Home > Research > Publications & Outputs > FewVS

Links

Text available via DOI:

View graph of relations

FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification. / Li, Zhuoling; Wang, Yong; Li, Kaitong.
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia. ed. / Jianfei Cai; Mohan Kankanhalli. New York: ACM, 2024. p. 1341-1350.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Li, Z, Wang, Y & Li, K 2024, FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification. in J Cai & M Kankanhalli (eds), MM '24: Proceedings of the 32nd ACM International Conference on Multimedia. ACM, New York, pp. 1341-1350. https://doi.org/10.1145/3664647.3681427

APA

Li, Z., Wang, Y., & Li, K. (2024). FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification. In J. Cai, & M. Kankanhalli (Eds.), MM '24: Proceedings of the 32nd ACM International Conference on Multimedia (pp. 1341-1350). ACM. https://doi.org/10.1145/3664647.3681427

Vancouver

Li Z, Wang Y, Li K. FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification. In Cai J, Kankanhalli M, editors, MM '24: Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM. 2024. p. 1341-1350 doi: 10.1145/3664647.3681427

Author

Li, Zhuoling ; Wang, Yong ; Li, Kaitong. / FewVS : A Vision-Semantics Integration Framework for Few-Shot Image Classification. MM '24: Proceedings of the 32nd ACM International Conference on Multimedia. editor / Jianfei Cai ; Mohan Kankanhalli. New York : ACM, 2024. pp. 1341-1350

Bibtex

@inproceedings{b087a6e5d3eb4d839d384fa2767805e9,
title = "FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification",
abstract = "Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only limited information, which is insufficient to capture the visual details in images. As a result, such vision-semantics alignment is inherently biased, leading to suboptimal integration outcomes. In this paper, we avoid such biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the few-shot encoder and CLIP's vision encoder on the same image. This alignment is accomplished through a linear projection layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/zhuolingli/FewVS.",
author = "Zhuoling Li and Yong Wang and Kaitong Li",
year = "2024",
month = oct,
day = "28",
doi = "10.1145/3664647.3681427",
language = "English",
pages = "1341--1350",
editor = "Jianfei Cai and Mohan Kankanhalli",
booktitle = "MM '24",
publisher = "ACM",

}

RIS

TY - GEN

T1 - FewVS

T2 - A Vision-Semantics Integration Framework for Few-Shot Image Classification

AU - Li, Zhuoling

AU - Wang, Yong

AU - Li, Kaitong

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only limited information, which is insufficient to capture the visual details in images. As a result, such vision-semantics alignment is inherently biased, leading to suboptimal integration outcomes. In this paper, we avoid such biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the few-shot encoder and CLIP's vision encoder on the same image. This alignment is accomplished through a linear projection layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/zhuolingli/FewVS.

AB - Some recent methods address few-shot image classification by extracting semantic information from class names and devising mechanisms for aligning vision and semantics to integrate information from both modalities. However, class names provide only limited information, which is insufficient to capture the visual details in images. As a result, such vision-semantics alignment is inherently biased, leading to suboptimal integration outcomes. In this paper, we avoid such biased vision-semantics alignment by introducing CLIP, a natural bridge between vision and semantics, and enforcing unbiased vision-vision alignment as a proxy task. Specifically, we align features encoded from the few-shot encoder and CLIP's vision encoder on the same image. This alignment is accomplished through a linear projection layer, with a training objective formulated using optimal transport-based assignment prediction. Thanks to the inherent alignment between CLIP's vision and text encoders, the few-shot encoder is indirectly aligned to CLIP's text encoder, which serves as the foundation for better vision-semantics integration. In addition, to further improve vision-semantics integration at the testing stage, we mine potential fine-grained semantic attributes of class names from large language models. Correspondingly, an online optimization module is designed to adaptively integrate the semantic attributes and visual information extracted from images. Extensive results on four datasets demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/zhuolingli/FewVS.

U2 - 10.1145/3664647.3681427

DO - 10.1145/3664647.3681427

M3 - Conference contribution/Paper

SP - 1341

EP - 1350

BT - MM '24

A2 - Cai, Jianfei

A2 - Kankanhalli, Mohan

PB - ACM

CY - New York

ER -