Home > Research > Publications & Outputs > Interpretable adversarial example detection via...

Associated organisational unit

Electronic data

  • COSE-D-24-02355

    Accepted author manuscript, 1.61 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Interpretable adversarial example detection via high-level concept activation vector

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published
Close
Article number104218
<mark>Journal publication date</mark>31/03/2025
<mark>Journal</mark>Computers and Security
Volume150
Publication StatusPublished
Early online date30/11/24
<mark>Original language</mark>English

Abstract

Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.