In this tutorial, we apply FixOut on the German Dataset. First, let us import the necessary packages.
base = 'C:\\alves\\git\\expout\\'
import os, sys
module_path = os.path.abspath(os.path.join(base))
if module_path not in sys.path:
sys.path.append(module_path)
from fixout.core_tabular import FixOutTabular, EnsembleOutTabular
from fixout.lime_tabular_global import TabularExplainer
from fixout.core import load_data
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
In order to use FixOut, we must indicate the values of some parameters. Concerning the dataset, we need to input the indexes of which features are sensitive features ("drop_list" parameter) and which are categorical features ("cat_features_list" parameter) as well. Concerning the classifier, we also need to inform which algorithm we want to use as the original model ("algo" parameter).
A list of possible classifiers that users can use with FixOut is given below. All algorithms are performed using default parameters defined in Scikit Learn library.
An example of how to set up the parameters values is presented below.
general_param = {
"train_size" : 0.7,
"max_features" : 10,
"sample_size" : 100
}
data_param = {
"name" : "german",
"source_name" : base+"examples/datasets/german.data",
"sep" : " ",
"all_categorical_features" : [0,2,3,5,6,8,9,11,13,14,16,18,19],
"sensitive" : [8,18,19], # statussex, telephone, foreignworker
"priv_group" : {
"statussex" : 2, # A93 male single
"telephone" : 1, # A192 : yes, registered under the customers name
"foreignworker" : 0 # A202 : no
},
"pos_label" : 0
}
According to the configuration, FixOut will take a look at the top-10 features with the highest contributions ("max_features" parameter). Other values that were given to LimeOut parameters in this example are:
We can initialize FixOut.
# load and split the data
X, y, class_names, current_feature_names, categorical_names = load_data(data_param["source_name"], data_param["all_categorical_features"])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=general_param["train_size"], random_state=42)
model = BaggingClassifier()
model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))
# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_original.global_explanation(n_samples=general_param["sample_size"])
explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_original.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])
df
As the sensitive feature 'foreignworker' appears in the top-10 most important features, the model is deemed unfair and FixOut builds an ensemble model using feature dropout.
# make an ensemble
ensemble = EnsembleOutTabular(model, sensitive_features=data_param["sensitive"])
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))
# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train, categorical_features=data_param["all_categorical_features"])
explainer_ensemble.global_explanation(n_samples=general_param["sample_size"])
explanations = [[current_feature_names[i], contrib] for i, contrib in explainer_ensemble.get_top_k(k=general_param["max_features"])]
df = pd.DataFrame(explanations, columns = ["Feature", "Contribution"])
df
The sensitive features no longer appear in the list of top-10 most important features.