
Iterative Knockout – A Method for Finding Feature Importance and High-Order interactions in (Recurrent) Neural Networks [Python] [Keras]¶
Description¶
This repo contains python scripts for determining feature importance and high order interactions in a recurrent neural network in Keras. The idea is to train a a model and then selectively nullify features and recalculate accuracy. If the accuracy changes significantly, then that feature(s) is considered to be important. These scripts contain functions to calculate accuarcy for by nullifying features 1 at a time, by a determined gram size – that is n-sized chunks of features next to each other in sequence-, or by high order- feature nullification for all combinations of features. These functions will work for data formatted for standard timestep processing i.e. (sample size, # timesteps, # feature categories). Later, I may add this method for Convnets to find feature importance in images, but I think that will be too computationally demanding.
Pipline¶
As stated earlier, the data should be in the format (n, # timesteps, # feature categories) e.g. if you were modelling text it would be (# passages, length of passages, one-hot of the word(s) that appears at that timestep). The Data script contains some functions for generating random data in this format. The model script currently builds a simple bi-directional LSTM with binary output, so if you use this model your labels will need to be binary and in the format (n,).
There are two methods for knocking out data, one is to knockout out data and retrain the model after each knockout and compare model accuracy. Because this requires retraining for each knockout, it is not feasible for high-order interactions as a sample with 10 timesteps will require over 1000 iterations. Instead, I propose that you split your data (use sklearn.cross_validation or some shit) into two numpy arrays. Use one of the arrays to train the model.
After the model has been trained, use the second part for the knockout process. First, use the trained model to generate predictions. Use Mean_Log_Loss from the metrics script on the predictions to calculate accuracy of the model. Then (also n the metrics script), use your desired feature knockout- Single_Iterative_Knockout, N-Gram_Iterative_Knockout, or High_Order_Iterative_Knockout. This will perform the knockout process accordingly and generate predictions for each iteration. Then it will calculate change in accuracy using the same cost function (mean log loss) from before and return a list of accuracies and a list of indices (higher number means that the iteration had a more significant effect on the model’s accuracy).
Usage¶
Metrics¶
Mean_Log_Loss(predictions, labels, limit=10)
- Description: calculates accuracy of predictions using mean log loss for cost.
- predictions: A list of predictions as outputted from the model.
- labels: An array of labels that correspond, sequentially to the predictions.
- limit: An integer 10^-limit gets added to difference between predictions and labels to avoid taking a log of 0 which happens when keras predicts too similar to the label.
Single_Iterative_Knockout(features_knockout, model, labels, baseline)
- Description: Calculates feature importance, one feature at a time.
- features_knockout: An array of features that have been set aside to perform the knockout on.
- model: A trained Keras model for predicting features_knockout.
- labels: An array of labels to correspond with features_knockout.
- baseline: A float outputted from Mean_Log_Loss.
N_Gram_Iterative_Knockout(features_knockout, model, labels, baseline, gram_size=2)
- Description: Calculates feature importance in sequential chunks of features of a determined length.
- features_knockout: An array of features that have been set aside to perform the knockout on.
- model: A trained Keras model for predicting features_knockout.
- labels: An array of labels to correspond with features_knockout.
- baseline: A float outputted from Mean_Log_Loss.
- gram_size: number of features to group together.
High_Order_Iterative_Knockout(features_knockout, model, labels, baseline,)
- Description: Calculates feature importance for all possible combinations of features.
- features_knockout: An array of features that have been set aside to perform the knockout on.
- model: A trained Keras model for predicting features_knockout.
- labels: An array of labels to correspond with features_knockout.
- baseline: A float outputted from Mean_Log_Loss.
Data¶
Create_Features(sample_size, sequence_length, feature_length)
- Description: Creates random features based on parameters.
- sample_size: Desired sample size.
- sequence_length: Desired sequence length.
- feature_length: Desired feature length.
Create_Labels(sample_size)
- Description: Creates random, binary labels.
- sample_size: Desired sample size.
Sample_Data(sample_size,sequence_length, target)
- Description: A function to generate data that emulates DNA and targets a specified site for association with a positive label.
- sample_size: Desired sample size.
- sequence_length: Desired sequence length.
- target: An integer of a timestep to be highly associated with a positive label.
Models¶
Create_RNN(input_shape)
- Description: Creates a simple bidirectional LSTM.
- input_shape: a tuple with a length of 3 specifying input shape.
Example¶
Here is an example of how to perform the iterative knockout using some sample data that I made to emulate DNA. The features have n samples, x sequence lengths, and 4 possible nucleotides (A, G, C, U). The labels are binary, think of it as having a disease (0) or not having a disease (1). In this example I will target a particular site that if a “G” appears at that site, the person will automatically be labelled as a 1. Conversely, if the sequence does not contain a “G” at the target site, it will be labelled a 0. Thus the target site should be highly important to the accuracy of the model.
from Data import Sample_Data
from Models import Create_RNN
from keras.callbacks import EarlyStopping
from Metrics import Mean_Log_Loss, Single_Iterative_Knockout, N_Gram_Iterative_Knockout, High_Order_Iterative_Knockout
import numpy as np
# Set some parameters - index 5 (the 6th feature) will be highly associated with a positive label.
sample_size = 1000
sequence_length = 10
target = 5
# Generate data that will be used to train the model.
features, phenotype = Sample_Data(sample_size = sample_size, sequence_length = sequence_length, target = target)
print("features shape: ", features.shape)
print("labels shape: ", phenotype.shape)
# Generate some seperate data that will be used in predictions.
features_test, phenotype_test = Sample_Data(sample_size = sample_size, sequence_length = sequence_length, target = target)
# Train the model
model = Create_RNN(features.shape[1:])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['accuracy'])
model.summary()
model.fit(features, phenotype,
batch_size=32,
validation_split=0.2,
epochs=50,
callbacks=[EarlyStopping(patience=4, monitor='val_loss')],
verbose=0)
# Get a baseline accuracy for the model
predictions = model.predict(features_test).reshape(sample_size,).tolist()
accuracy = Mean_Log_Loss(predictions = predictions, labels = phenotype_test)
print("baseline accuracy: ",accuracy,"\n")
# Generate single knockout predictions (the 5th index should be significantly higher than others).
single_iterative_knockout = Single_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test)
print("single iterative knockout accuracy change: ")
print(single_iterative_knockout)
Notice that the target feature, the 6th one is about 3 orders of magnitude higher than the others.
# Generate knockout predictions for a specified gram size - I'll use 3.
n_gram_knockout, index = N_Gram_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test, gram_size = 3)
print("n-gram knockout accuracy change: ")
print( n_gram_knockout,"\n")
print("index: ")
print(index)
In the n-gram knockout, iterations knocking out the target feature- ‘3:5’, ‘4:6’, and ‘5:7’- are about 1 order of magnitude higher than the others. Though, in this case, increasing gram size will decrease decrease the significance of containing the target feature as there is no association between other features and a positive phenotype in my sample data.
# Generate high-order iterative knockouts
high_order_knockout, index = High_Order_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test)
print("high-order knockout accuracy change: ")
print( high_order_knockout,"\n")
print("index: ")
print(index)
In the high-order knockout, iterations knocking out the target feature should have a higher change in accuracy. Since, in my sample data there is no high order interactions, this association should dwindle with higher order interactions e.g. a 3-5 couplet will have a higher value than a 3-5-7-8-9 pentuplet.
Installation¶
I might build a wheel later, but for now just download the dependencies- Numpy, Keras, and Tensorflow and import the scripts.