Home > Research > Publications & Outputs > Learning Comprehensive Representations with Ric...

Links

Text available via DOI:

View graph of relations

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification. / Yan, Shuanglin; Dong, Neng; Liu, Jun et al.
MM '23: Proceedings of the 31st ACM International Conference on Multimedia. ed. / Abdulmotaleb El Saddik; Tao Mei; Rita Cucchiara. New York: ACM, 2023. p. 6202-6211.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Yan, S, Dong, N, Liu, J, Zhang, L & Tang, J 2023, Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification. in A El Saddik, T Mei & R Cucchiara (eds), MM '23: Proceedings of the 31st ACM International Conference on Multimedia. ACM, New York, pp. 6202-6211. https://doi.org/10.1145/3581783.3611832

APA

Yan, S., Dong, N., Liu, J., Zhang, L., & Tang, J. (2023). Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification. In A. El Saddik, T. Mei, & R. Cucchiara (Eds.), MM '23: Proceedings of the 31st ACM International Conference on Multimedia (pp. 6202-6211). ACM. https://doi.org/10.1145/3581783.3611832

Vancouver

Yan S, Dong N, Liu J, Zhang L, Tang J. Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification. In El Saddik A, Mei T, Cucchiara R, editors, MM '23: Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM. 2023. p. 6202-6211 doi: 10.1145/3581783.3611832

Author

Yan, Shuanglin ; Dong, Neng ; Liu, Jun et al. / Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification. MM '23: Proceedings of the 31st ACM International Conference on Multimedia. editor / Abdulmotaleb El Saddik ; Tao Mei ; Rita Cucchiara. New York : ACM, 2023. pp. 6202-6211

Bibtex

@inproceedings{00a5fe5e27754bbeb7d014438a0cc4a5,
title = "Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification",
abstract = "Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR2S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features are aligned to train a {"}richer{"} TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the {"}richer{"} model into a lightweight model for inference with a single image/text as input. The lightweight model focus on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the {"}richer{"} model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR2S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.",
author = "Shuanglin Yan and Neng Dong and Jun Liu and Liyan Zhang and Jinhui Tang",
year = "2023",
month = oct,
day = "26",
doi = "10.1145/3581783.3611832",
language = "English",
pages = "6202--6211",
editor = "{El Saddik}, Abdulmotaleb and Tao Mei and Rita Cucchiara",
booktitle = "MM '23: Proceedings of the 31st ACM International Conference on Multimedia",
publisher = "ACM",

}

RIS

TY - GEN

T1 - Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

AU - Yan, Shuanglin

AU - Dong, Neng

AU - Liu, Jun

AU - Zhang, Liyan

AU - Tang, Jinhui

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR2S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features are aligned to train a "richer" TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the "richer" model into a lightweight model for inference with a single image/text as input. The lightweight model focus on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the "richer" model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR2S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.

AB - Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR2S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features are aligned to train a "richer" TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the "richer" model into a lightweight model for inference with a single image/text as input. The lightweight model focus on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the "richer" model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR2S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.

U2 - 10.1145/3581783.3611832

DO - 10.1145/3581783.3611832

M3 - Conference contribution/Paper

SP - 6202

EP - 6211

BT - MM '23: Proceedings of the 31st ACM International Conference on Multimedia

A2 - El Saddik, Abdulmotaleb

A2 - Mei, Tao

A2 - Cucchiara, Rita

PB - ACM

CY - New York

ER -