Adversarial Attacks and Defenses in XAI: A Survey


 Hubert Baniecki, Przemyslaw Biecek


University of Warsaw, Poland


 IJCAI 2023 Workshop on XAI, Macao, SAR

August 31, 2023

Prologue

I am a PhD student in computer science/XAI interested in adversarial attacks and evaluation protocols.

Disclaimer: Parts of this presentation come from other published work.

Contributions & comments are welcomed!

Write me at h.baniecki@uw.edu.pl

Why a survey paper?

Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019

Slack et al. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AIES 2020

Adversarial ML vs. Explainable AI

Attack: adversarial example

For a prediction:

\[ \mathbf{x} \rightarrow \mathbf{x}' \Longrightarrow f(\mathbf{x}) \neq f(\mathbf{x}') \] where \(\rightarrow\) may be an “invisible’’ data perturbation.

What about an explanation?

\[ g(f,\mathbf{x}) \]

\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \]

\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x}' \Longrightarrow g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red}x'}) \]

\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]

Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019

Attack: adversarial example (cont.)

How to find \(\mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'}\)? An optimization problem.

  • For neural networks*: use gradients

*differentiable models and explanation methods

  • For black-box models with model-agnostic explanations:
    use genetic algorithms

Defense? Prevention

\[ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]

  1. explanation aggregation
    \(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;{\color{green}h}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}h}(f,\mathbf{\color{red} x'})\)
  2. model regularization
    \(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;g({\color{green}f'},\mathbf{\color{blue} x})\;{\color{green} \approx}\; g({\color{green}f'},\mathbf{\color{red} x'})\)
  3. robustness, stability, uncertainty, …

Defense: explanation aggregation

\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;{\color{green}h}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}h}(f,\mathbf{\color{red} x'})\); \(\;{\color{green}k}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}k}(f,\mathbf{\color{red} x'})\)

Rieger & Hansen. A simple defense against adversarial attacks on heatmap explanations. ICML WHI 2020

Defense: model regularization

\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;g({\color{green}f'},\mathbf{\color{blue} x})\;{\color{green} \approx}\; g({\color{green}f'},\mathbf{\color{red} x'})\)

A chain of works improving the robustness of explanations.

Survey: systematization, research gaps, future work

Systematization of the attacks

Extended version (to appear on arXiv):

Insecurities in Explainable AI

Future directions in Adversarial XAI

  • Attacks on: more recent explanation algorithms, explainable
    “by-design’’ machine learning models, bypass the defense (attack\(^2\))
  • Defenses: prevent current insecurities in XAI, improve explanation algorithms (robustness), update safety & evaluation protocols
  • AdvXAI beyond the image and tabular data modalities.
  • AdvXAI beyond classical models towards transformers.
  • Ethics, impact on society, and law concerning AdvXAI.
  • (No) Software, datasets and benchmarks.

GitHub list since 2020

Next: submit the extended version to a journal.

References

  • H. Baniecki, P. Biecek. Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey. arXiv preprint arXiv:2306.06123.
  • Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019
  • Slack et al. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AIES 2020
  • Rieger & Hansen. A simple defense against adversarial attacks on heatmap explanations. ICML WHI 2020