model-specific

AdvBox

Advbox offers a number of AI model security toolkits. AdversialBox allows zero-coding generation of adversial examples for a wide range of neural network frameworks. An overview of the supported attacks and defenses can be found here and the corresponding code here. It requires some effort to find all attacks mentioned on the homepage in the code base. Generally speaking, the documentation of AdvBox is incomplete and not very user-friendly. ODD: Object Detector Deception showcases a specific attack for object detection networks such as YOLO, but is not mentioned in the README. Read more...

Alibi

Alibi is an open-source Python library that supports various interpretability techniques and a broad array of explanation types. The README already provides an overview of the supported methods and when they are applicable. The following table with supported methods is copied from the README (slightly abbreviated): Supported methods Method Models Explanations Classification Regression Tabular Text Images Categorical features ALE BB global ✔ ✔ ✔ Anchors BB local ✔ ✔ ✔ ✔ ✔ CEM BB* TF/Keras local ✔ ✔ ✔ Counterfactuals BB* TF/Keras local ✔ ✔ ✔ Prototype Counterfactuals BB* TF/Keras local ✔ ✔ ✔ ✔ Integrated Gradients TF/Keras local ✔ ✔ ✔ ✔ ✔ ✔ Kernel SHAP BB local global ✔ ✔ ✔ ✔ Tree SHAP WB local global ✔ ✔ ✔ ✔ The README also explains the keys: Read more...

ART: Adversial Robustness 360 Toolbox

The Adversial Robustness Toolbox (ART) is the first comprehensive toolbox that unifies many defensive techniques for four categories of adversarial attacks on machine learning models. These categories are model evasion, model poisoning, model extraction and inference (e.g. inference of sensitive attributes in the training data; or determining whether an example was part of the training data). ART supports all popular machine learning frameworks, all data types and all machine learning tasks. Read more...

Captum

Captum is a model interpretability library specifically PyTorch. It is actively maintained at the moment of writing and supports an extensive array of interpretability methods. The Captum website also offers a large range of hands-on tutorials for various use cases. Supported interpretability methods Captum supports a very extensive list of interpretability algorithms. All paper references for each of the supported methods are listed in the README, so they will not be repeated here. Read more...

CleverHans

CleverHans is a Python library with the main purpose of providing good reference implementations of attacks for benchmarking machine learning models against adversarial examples. The main maintainers of this library are Ian Goodfellow and Nicolas Papernot. Attacks (i.e. methods for generating adversarial examples) are listed under /cleverhans and each of the supported frameworks has its own folder with attack implementations. CleverHans also aims to implement a set of defenses, but this is currently work in progress (currently there is only a defense implementation for PyTorch). Read more...

Debiaswe: try to make word embeddings less sexist

Word embeddings are a widely used representation for text data. A well-known example in natural language processing (NLP) is Word2vec, which uses a neural network to learn latent vector representations of words. It turns out that relations in this latent vector space capture semantic relations quite well. For example, by finding similar vectors you typically end up with highly related or synonymous words. Another typical example is that when you add up the vectors of “king” and “woman”, you end up with the vector corresponding to “queen”, so even a form of conceptual calculus is possible. Read more...

DeepExplain

The DeepExplain Python package for TensorFlow models and Keras models with TensorFlow backend offers two types of interpretability methods for deep convolutional neural networks: gradient-based methods and perturbation-based methods. This package does not seem to be very actively maintained anymore and support for TensorFlow V2 is limited. Attributions The README gives the following clear and succinct explanation of what an “attribution” is. All methods included in this approach allow visualization of how each input feature contributes to the final prediction, in terms of what a particular targeted neuron “sees”: Read more...

DeepLIFT

A brief explanation of the gradient-based interpretability method called DeepLIFT is given by Shrikumar et al. in the abstract of the linked paper: DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its ‘reference activation’ and assigns contribution scores according to the difference. Read more...

DiCE: Diverse Counterfactual Explanations

From README: DiCE implements counterfactual (CF) explanations that provide this information by showing feature-perturbed versions of the same person who would have received the loan, e.g., you would have received the loan if your income was higher by $10,000. In other words, it provides “what-if” explanations for model output and can be a useful complement to other explanation methods, both for end-users and model developers. A main innovation of DiCE is that it implements a method to make producing counter-factual examples more model-agnostic: Read more...

ELI5

ELI5 (“Explain Like I’m 5”) provides model-specific support for models from scikit-learn, lightning, decision tree ensembles using the xgboost, LightGBM, CatBoost libraries. ELI5 mainly provides convenient wrappers to couple the feature importance coefficients that these libraries already provide with feature names, as well as convenient ways to visualize importances, e.g. by highlighting words in a text. For Keras image classifiers an implementation of the gradient-based Grad-CAM visualizations is offered, but the TensorFlow V2 backend is not supported. Read more...

Fairness in Classification

The not-so originally named “fairness in classification” provides a Python implementation of three fairness constraints for logistic regression: Disparate impact: similar acceptance rate for different demographic groups. See Zafar et al., 2017 a. Disparate mistreatment: similar misclassification rate for different demographic groups. See Zafar et al., 2017b Preference-based fairness (as opposed to parity-based fairness): a more game-theoretical approach where decision boundaries are chosen such that it can be shown that each group prefers its own decision boundary, if rational. Read more...

Foolbox

Foolbox is a comprehensive adversarial library for attacking machine learning models, with a focus on neural networks in computer vision. At the moment of writing FoolBox contains 41 gradient-based and decision-based adversarial attacks, making it the second biggest adversial library after ART . A notable difference with ART is that Foolbox only contains attacks, but no defenses and evaluation metrics. The library is very user-friendly, with a clear API and documentation. Read more...

Interpret-Text

Interpret-Text is an extension of InterpretML , specifically for several text models. Three modules are provided: ClassicalTextExplainer, UnifiedInformationExplainer and IntrospectiveRationaleExplainer. Classical Text Explainer The ClassicalTextExplainer supports linear models from sklearn with a coefs_ call and tree-based models for which feature_importances_ is defined. ClassicalTextExplainer includes a NLP pipeline from preprocessing to hyperparameter tuning, so it accepts raw text data as input. The default pipeline uses a unigram bag-of-words model. Elements of the pipeline can be replaced if desired. Read more...

InterpretML

The InterpretML toolkit, developed at Microsoft, can be decomposed in two major components: A set of interpretable “glassbox” models Techniques for explaining black box systems. W.r.t. 1, InterpretML particularly contains a new interpretable “glassbox” model that combines Generalized Additive Models (GAMs) with machine learning techniques such as gradient boosted trees, called an Explainable Boosting Machine. Other than this new interpretable model, the main utility of InterpretML is to unify existing explainability techniques under a single API. Read more...

OpenMined (PySyft)

The OpenMined community is a collaboration of several organizations, including TensorFlow, PyTorch and Keras, to create an open-source ecosystem of privacy tools that extend libraries such as PyTorch with cryptographic techniques and differential privacy. The aim is to contribute to the adaptation of privacy-preserving AI. To this end, OpenMined offers several privacy-preserving tools on their github. A main tool is PySyft, which allows “computing on data you do not own and cannot see”. Read more...

SHAP: SHapley Additive exPlanations

The SHAP package is built on the concept of a Shapley value and can generate explanations model-agnostically. So it only requires input and output values, not model internals: SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. (README) Additionally, this package also contains several model-specific implementations of Shapley values that are optimized for a particular machine learning model and sometimes even for a particular library. Read more...

TensorFlow Privacy

TensorFlow Privacy is a library that allows you to replace default TensorFlow optimizers with optimizers that allow training with differential privacy, i.e. they implement forms of stochastic gradient descent (SGD) with differential privacy. Because large neural networks or other differentiable models have a very large learning capacity, it can happen that the model achieves high performance on uncommon training input by simply “memorizing” the training input. If the training data is sensitive, for example information about a specific user, this is undesired behavior that may leak private information. Read more...

TreeInterpreter

This library provides a separate predict() function for scikit-learn tree-based models (so also ensembles) that outputs a prediction with interpretable elements of the shape prediction = bias + feature_1_contribution + ... + feature_n_contribution. That is, it turns these tree-based models into a white box , where we can inspect how much each feature contributes to the predicted value (in the case of regression) or how much it contributes to the estimated probability of a class (given classification). Read more...