Hubert Baniecki, Przemyslaw Biecek
University of Warsaw, Poland
IJCAI 2023 Workshop on XAI, Macao, SAR
August 31, 2023
This work was financially supported by the Polish National Science Centre grant number 2021/43/O/ST6/00347.
I am a PhD student in computer science/XAI interested in adversarial attacks and evaluation protocols.
Disclaimer: Parts of this presentation come from other published work.
Contributions & comments are welcomed!
Write me at h.baniecki@uw.edu.pl
Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019
Slack et al. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AIES 2020
Adversarial ML vs. Explainable AI
For a prediction:
\[ \mathbf{x} \rightarrow \mathbf{x}' \Longrightarrow f(\mathbf{x}) \neq f(\mathbf{x}') \] where \(\rightarrow\) may be an “invisible’’ data perturbation.
What about an explanation?
\[ g(f,\mathbf{x}) \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x}' \Longrightarrow g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red}x'}) \]
\[ g(f,\mathbf{x}) := \mathbf{x} \odot \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \\ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]
Dombrowski et al. Explanations can be manipulated and geometry is to blame. NeurIPS 2019
How to find \(\mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'}\)? An optimization problem.
*differentiable models and explanation methods
\[ \mathbf{\color{blue} x} \rightarrow \mathbf{\color{red} x'} \Longrightarrow \left\{\begin{array}{@{}l@{}} g(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; g(f,\mathbf{\color{red} x'}) \\ f(\mathbf{\color{blue} x}) \approx f(\mathbf{\color{red} x'}) \\ \end{array}\right. \]
\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;{\color{green}h}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}h}(f,\mathbf{\color{red} x'})\); \(\;{\color{green}k}(f,\mathbf{\color{blue} x})\;{\color{green} \approx}\; {\color{green}k}(f,\mathbf{\color{red} x'})\)
Rieger & Hansen. A simple defense against adversarial attacks on heatmap explanations. ICML WHI 2020
\(g(f,\mathbf{\color{blue} x}) \neq g(f,\mathbf{\color{red} x'})\;\) but \(\;g({\color{green}f'},\mathbf{\color{blue} x})\;{\color{green} \approx}\; g({\color{green}f'},\mathbf{\color{red} x'})\)
A chain of works improving the robustness of explanations.
Extended version (to appear on arXiv):
Next: submit the extended version to a journal.
Hubert Baniecki – AdvXAI – slides: hbaniecki.com/ijcai2023 – lab: mi2.ai