Home > Research > Publications & Outputs > CPAL

Associated organisational unit

Electronic data

  • paper (1)

    Accepted author manuscript, 3.54 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

CPAL: Cross-prompting Adapter with LoRAs for RGB+X Semantic Segmentation

Research output: Contribution to Journal/MagazineJournal articlepeer-review

E-pub ahead of print
  • Ye Liu
  • Pengfei Wu
  • Miaohui Wang
  • Jun Liu
Close
<mark>Journal publication date</mark>29/01/2025
<mark>Journal</mark>IEEE Transactions on Circuits and Systems for Video Technology
Publication StatusE-pub ahead of print
Early online date29/01/25
<mark>Original language</mark>English

Abstract

As sensor technology evolves, RGB+X systems combine traditional RGB cameras with another type of auxiliary sensor, which enhances perception capabilities and provides richer information for important tasks such as semantic segmentation. However, acquiring massive RGB+X data is difficult due to the need for specific acquisition equipment. Therefore, traditional RGB+X segmentation methods often perform pretraining on relatively abundant RGB data. However, these methods lack corresponding mechanisms to fully exploit the pretrained model, and the scope of the pretraining RGB dataset remains limited. Recent works have employed prompt learning to tap into the potential of pretrained foundation models, but these methods adopt a unidirectional prompting approach i.e., using X or RGB+X modality to prompt pretrained foundation models in RGB modality, neglecting the potential in non-RGB modalities. In this paper, we are dedicated to developing the potential of pretrained foundation models in both RGB and non-RGB modalities simultaneously, which is non-trivial due to the semantic gap between modalities. Specifically, we present the CPAL (Cross-prompting Adapter with LoRAs), a framework that features a novel bi-directional adapter to simultaneously fully exploit the complementarity and bridging the semantic gap between modalities. Additionally, CPAL introduces low-rank adaption (LoRA) to fine-tune the foundation model of each modal. With the support of these elements, we have successfully unleashed the potential of RGB foundation models in both RGB and non-RGB modalities simultaneously. Our method achieves state-of-the-art (SOTA) performance on five multi-modal benchmarks, including RGB+Depth, RGB+Thermal, RGB+Event, and a multi-modal video object segmentation benchmark, as well as four multi-modal salient object detection benchmarks. The code and results are available at: https://github.com/abelny56/CPAL.