Interpretable adversarial example detection via high-level concept activation vector

Home > Research > Publications & Outputs > Interpretable adversarial example detection via...

Computing and Communications

Associated organisational unit

Insight

Electronic data

COSE-D-24-02355
Accepted author manuscript, 1.61 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1016/j.cose.2024.104218
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Jiaxing Li
Yu-an Tan
Xinyu Liu
Weizhi Meng
Yuanzhang Li

More...

Article number	104218
<mark>Journal publication date</mark>	31/03/2025
<mark>Journal</mark>	Computers and Security
Volume	150
Publication Status	Published
Early online date	30/11/24
<mark>Original language</mark>	English

Abstract

Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.

Research

Associated organisational unit

Electronic data

Links

Text available via DOI:

Interpretable adversarial example detection via high-level concept activation vector

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us