Tutorial 1: FixOut on Tabular Data with LIME explanations

In this tutorial, we apply FixOut on the German Dataset. First, let us import the necessary packages.

In [1]:
base = 'C:\\alves\\git\\expout\\'
import os, sys
module_path = os.path.abspath(os.path.join(base))

if module_path not in sys.path:
    sys.path.append(module_path)

from fixout.core_tabular import FixOutTabular, EnsembleOutTabular
from fixout.lime_tabular_global import TabularExplainer
from fixout.core import load_data

import numpy as np
import pandas as pd

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:
pip install 'aif360[AdversarialDebiasing]'

Parameters

In order to use FixOut, we must indicate the values of some parameters. Concerning the dataset, we need to input the indexes of which features are sensitive features ("drop_list" parameter) and which are categorical features ("cat_features_list" parameter) as well. Concerning the classifier, we also need to inform which algorithm we want to use as the original model ("algo" parameter).

A list of possible classifiers that users can use with FixOut is given below. All algorithms are performed using default parameters defined in Scikit Learn library.

  • Neural Network (MLP)
  • Logistic Regression
  • Random Forest (RF)
  • AdaBoost
  • Bagging
  • Gaussian Mixture
  • Gradient Boosting
  • Support Vector Machine (SVM)

An example of how to set up the parameters values is presented below.

In [2]:
general_param = {
    "train_size" : 0.7,
    "max_features" : 10,
    "sample_size" : 100
    }
data_param = {
    "name" : "german",
    "source_name" : base+"examples/datasets/german.data",
    "sep" : " ",
    "all_categorical_features" : [0,2,3,5,6,8,9,11,13,14,16,18,19],
    "sensitive" : [8,18,19], # statussex, telephone, foreignworker
    "priv_group" : {
        "statussex" : 2, # A93 male single
        "telephone" : 1, # A192 : yes, registered under the customers name
        "foreignworker" : 0 # A202 : no
        },
    "pos_label" : 0
    }

According to the configuration, FixOut will take a look at the top-10 features with the highest contributions ("max_features" parameter). Other values that were given to LimeOut parameters in this example are:

  • the sample size of explanations is 500, which means that ExpGlobal will select 500 instances to explain the model,
  • the dataset will be divided in 70% for training and 30% for testing ("train_size" parameter).

We can initialize FixOut.

In [3]:
# load and split the data
X, y, class_names, current_feature_names, categorical_names = load_data(data_param["source_name"], data_param["all_categorical_features"])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=general_param["train_size"], random_state=42)

Reliance on sensitive features: Pre-trained model

In [4]:
model = BaggingClassifier()

model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))
Original score: 0.7333333333333333
In [5]:
# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_original.global_explanation(n_samples=general_param["sample_size"])

explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_original.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])

df
Out[5]:
Feature Contribution
0 otherdebtors -0.047273
1 otherinstallmentplans -0.042423
2 existingchecking -0.021288
3 savings 0.017307
4 housing -0.010034
5 peopleliable 0.008032
6 foreignworker 0.007339
7 job -0.005559
8 property 0.005104
9 age -0.004119

As the sensitive feature 'foreignworker' appears in the top-10 most important features, the model is deemed unfair and FixOut builds an ensemble model using feature dropout.

Reliance on sensitive features: Ensemble Model

In [6]:
# make an ensemble
ensemble = EnsembleOutTabular(model, sensitive_features=data_param["sensitive"])
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))
Ensemble score: 0.7366666666666667
In [7]:
# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_ensemble.global_explanation(n_samples=general_param["sample_size"])

explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_ensemble.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])

df
Out[7]:
Feature Contribution
0 otherinstallmentplans -0.042915
1 savings 0.018351
2 housing -0.012803
3 otherdebtors -0.011958
4 existingchecking -0.011763
5 credithistory 0.006263
6 job -0.004934
7 peopleliable -0.004572
8 installmentrate 0.003194
9 employmentsince -0.002490

The sensitive features no longer appear in the list of top-10 most important features.

In [ ]: