Home > Research > Publications & Outputs > Interpretable adversarial example detection via...

Associated organisational unit

Electronic data

  • COSE-D-24-02355

    Accepted author manuscript, 1.61 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Interpretable adversarial example detection via high-level concept activation vector

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Interpretable adversarial example detection via high-level concept activation vector. / Li, Jiaxing; Tan, Yu-an; Liu, Xinyu et al.
In: Computers and Security, Vol. 150, 104218, 31.03.2025.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

APA

Vancouver

Li J, Tan Y, Liu X, Meng W, Li Y. Interpretable adversarial example detection via high-level concept activation vector. Computers and Security. 2025 Mar 31;150:104218. Epub 2024 Nov 30. doi: 10.1016/j.cose.2024.104218

Author

Li, Jiaxing ; Tan, Yu-an ; Liu, Xinyu et al. / Interpretable adversarial example detection via high-level concept activation vector. In: Computers and Security. 2025 ; Vol. 150.

Bibtex

@article{05d20eb7a4e547bb865e66bbf3fcc445,
title = "Interpretable adversarial example detection via high-level concept activation vector",
abstract = "Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.",
author = "Jiaxing Li and Yu-an Tan and Xinyu Liu and Weizhi Meng and Yuanzhang Li",
year = "2025",
month = mar,
day = "31",
doi = "10.1016/j.cose.2024.104218",
language = "English",
volume = "150",
journal = "Computers and Security",
issn = "0167-4048",
publisher = "Elsevier Ltd",

}

RIS

TY - JOUR

T1 - Interpretable adversarial example detection via high-level concept activation vector

AU - Li, Jiaxing

AU - Tan, Yu-an

AU - Liu, Xinyu

AU - Meng, Weizhi

AU - Li, Yuanzhang

PY - 2025/3/31

Y1 - 2025/3/31

N2 - Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.

AB - Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.

U2 - 10.1016/j.cose.2024.104218

DO - 10.1016/j.cose.2024.104218

M3 - Journal article

VL - 150

JO - Computers and Security

JF - Computers and Security

SN - 0167-4048

M1 - 104218

ER -