UNetFormer - Research Portal | Lancaster University

Lancaster Environment Centre

Associated organisational units

Electronic data

UNetFormer_accepted
Rights statement: This is the author’s version of a work that was accepted for publication in ISPRS Journal of Photogrammetry and Remote Sensing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ISPRS Journal of Photogrammetry and Remote Sensing, 190, 2022 DOI: 10.1016/j.isprsjprs.2022.06.008
Accepted author manuscript, 8.64 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.1016/j.isprsjprs.2022.06.008
Final published version

Keywords

Semantic Segmentation, Remote Sensing, Vision Transformer, Fully Transformer Network, Global-local Context, Urban Scene

View graph of relations

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. / Wang, Libo; Li, Rui; Zhang, Ce et al.
In: ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 190, 31.08.2022, p. 196-214.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Wang, L, Li, R, Zhang, C, Fang, S, Duan, C, Meng, X & Atkinson, P 2022, 'UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery', ISPRS Journal of Photogrammetry and Remote Sensing, vol. 190, pp. 196-214. https://doi.org/10.1016/j.isprsjprs.2022.06.008

APA

Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., & Atkinson, P. (2022). UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 196-214. https://doi.org/10.1016/j.isprsjprs.2022.06.008

Vancouver

Wang L, Li R, Zhang C, Fang S, Duan C, Meng X et al. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing. 2022 Aug 31;190:196-214. Epub 2022 Jun 24. doi: 10.1016/j.isprsjprs.2022.06.008

Author

Wang, Libo ; Li, Rui ; Zhang, Ce et al. / UNetFormer : A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. In: ISPRS Journal of Photogrammetry and Remote Sensing. 2022 ; Vol. 190. pp. 196-214.

Bibtex

@article{70f7bd2897da425da94e8050ad5f0872,

title = "UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery",

abstract = "Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct an UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global–local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512 × 512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.",

keywords = "Semantic Segmentation, Remote Sensing, Vision Transformer, Fully Transformer Network, Global-local Context, Urban Scene",

author = "Libo Wang and Rui Li and Ce Zhang and Shenghui Fang and Chenxi Duan and Xiaoliang Meng and Peter Atkinson",

note = "This is the author{\textquoteright}s version of a work that was accepted for publication in ISPRS Journal of Photogrammetry and Remote Sensing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ISPRS Journal of Photogrammetry and Remote Sensing, 190, 2022 DOI: 10.1016/j.isprsjprs.2022.06.008",

year = "2022",

month = aug,

day = "31",

doi = "10.1016/j.isprsjprs.2022.06.008",

language = "English",

volume = "190",

pages = "196--214",

journal = "ISPRS Journal of Photogrammetry and Remote Sensing",

issn = "0924-2716",

publisher = "Elsevier Science B.V.",

}

RIS

TY - JOUR

T1 - UNetFormer

T2 - A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

AU - Wang, Libo

AU - Li, Rui

AU - Zhang, Ce

AU - Fang, Shenghui

AU - Duan, Chenxi

AU - Meng, Xiaoliang

AU - Atkinson, Peter

N1 - This is the author’s version of a work that was accepted for publication in ISPRS Journal of Photogrammetry and Remote Sensing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in ISPRS Journal of Photogrammetry and Remote Sensing, 190, 2022 DOI: 10.1016/j.isprsjprs.2022.06.008

PY - 2022/8/31

Y1 - 2022/8/31

N2 - Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct an UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global–local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512 × 512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

AB - Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct an UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global–local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512 × 512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

KW - Semantic Segmentation

KW - Remote Sensing

KW - Vision Transformer

KW - Fully Transformer Network

KW - Global-local Context

KW - Urban Scene

U2 - 10.1016/j.isprsjprs.2022.06.008

DO - 10.1016/j.isprsjprs.2022.06.008

M3 - Journal article

VL - 190

SP - 196

EP - 214

JO - ISPRS Journal of Photogrammetry and Remote Sensing

JF - ISPRS Journal of Photogrammetry and Remote Sensing

SN - 0924-2716

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us