Hubert Baniecki, Przemyslaw Biecek
University of Warsaw, Poland
IJCAI 2023 Workshop on XAI, Macao, SAR
August 31, 2023
I am a PhD student in computer science/XAI interested in adversarial attacks and evaluation protocols.
Disclaimer: Parts of this presentation come from other published work.
Contributions & comments are welcomed!
Write me at h.baniecki@uw.edu.pl
Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019
Slack et al. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AIES 2020
Adversarial ML vs. Explainable AI
For a prediction:
\[ \mathbf{x} \rightarrow \mathbf{x}' \Longrightarrow f(\mathbf{x}) \neq f(\mathbf{x}') \] where \(\rightarrow\) may be an “invisible’’ data perturbation.
What about an explanation?
\[ g(f,\mathbf{x}) \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x}' \Longrightarrow g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red}x'}) \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]
Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019
How to find \(\mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'}\)? An optimization problem.
*differentiable models and explanation methods
\[ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]
\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;{\color{green}h}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}h}(f,\mathbf{\color{red} x'})\); \(\;{\color{green}k}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}k}(f,\mathbf{\color{red} x'})\)
Rieger & Hansen. A simple defense against adversarial attacks on heatmap explanations. ICML WHI 2020
\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;g({\color{green}f'},\mathbf{\color{blue} x})\;{\color{green} \approx}\; g({\color{green}f'},\mathbf{\color{red} x'})\)
A chain of works improving the robustness of explanations.
Extended version (to appear on arXiv):
Next: submit the extended version to a journal.
Hubert Baniecki – AdvXAI – slides: hbaniecki.com/ijcai2023 – lab: mi2.ai