Tutorial 1: FixOut on Tabular Data with LIME explanations¶

In this tutorial, we apply FixOut on the German Dataset. First, let us import the necessary packages.

base = 'C:\\alves\\git\\expout\\'
import os, sys
module_path = os.path.abspath(os.path.join(base))

if module_path not in sys.path:
    sys.path.append(module_path)

from fixout.core_tabular import FixOutTabular, EnsembleOutTabular
from fixout.lime_tabular_global import TabularExplainer
from fixout.core import load_data

import numpy as np
import pandas as pd

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split

WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'

Parameters¶

In order to use FixOut, we must indicate the values of some parameters. Concerning the dataset, we need to input the indexes of which features are sensitive features ("drop_list" parameter) and which are categorical features ("cat_features_list" parameter) as well. Concerning the classifier, we also need to inform which algorithm we want to use as the original model ("algo" parameter).

A list of possible classifiers that users can use with FixOut is given below. All algorithms are performed using default parameters defined in Scikit Learn library.

Neural Network (MLP)
Logistic Regression
Random Forest (RF)
AdaBoost
Bagging
Gaussian Mixture
Gradient Boosting
Support Vector Machine (SVM)

An example of how to set up the parameters values is presented below.

general_param = {
    "train_size" : 0.7,
    "max_features" : 10,
    "sample_size" : 100
    }
data_param = {
    "name" : "german",
    "source_name" : base+"examples/datasets/german.data",
    "sep" : " ",
    "all_categorical_features" : [0,2,3,5,6,8,9,11,13,14,16,18,19],
    "sensitive" : [8,18,19], # statussex, telephone, foreignworker
    "priv_group" : {
        "statussex" : 2, # A93 male single
        "telephone" : 1, # A192 : yes, registered under the customers name
        "foreignworker" : 0 # A202 : no
        },
    "pos_label" : 0
    }

According to the configuration, FixOut will take a look at the top-10 features with the highest contributions ("max_features" parameter). Other values that were given to LimeOut parameters in this example are:

the sample size of explanations is 500, which means that ExpGlobal will select 500 instances to explain the model,
the dataset will be divided in 70% for training and 30% for testing ("train_size" parameter).

We can initialize FixOut.

# load and split the data
X, y, class_names, current_feature_names, categorical_names = load_data(data_param["source_name"], data_param["all_categorical_features"])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=general_param["train_size"], random_state=42)

Reliance on sensitive features: Pre-trained model¶

model = BaggingClassifier()

model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))

Original score: 0.7333333333333333

# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_original.global_explanation(n_samples=general_param["sample_size"])

explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_original.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])

df

As the sensitive feature 'foreignworker' appears in the top-10 most important features, the model is deemed unfair and FixOut builds an ensemble model using feature dropout.

Reliance on sensitive features: Ensemble Model¶

# make an ensemble
ensemble = EnsembleOutTabular(model, sensitive_features=data_param["sensitive"])
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))

Ensemble score: 0.7366666666666667

# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_ensemble.global_explanation(n_samples=general_param["sample_size"])

explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_ensemble.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])

df

The sensitive features no longer appear in the list of top-10 most important features.

	Feature	Contribution
0	otherdebtors	-0.047273
1	otherinstallmentplans	-0.042423
2	existingchecking	-0.021288
3	savings	0.017307
4	housing	-0.010034
5	peopleliable	0.008032
6	foreignworker	0.007339
7	job	-0.005559
8	property	0.005104
9	age	-0.004119

	Feature	Contribution
0	otherinstallmentplans	-0.042915
1	savings	0.018351
2	housing	-0.012803
3	otherdebtors	-0.011958
4	existingchecking	-0.011763
5	credithistory	0.006263
6	job	-0.004934
7	peopleliable	-0.004572
8	installmentrate	0.003194
9	employmentsince	-0.002490