Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Modify slightly the Bert SQUAD interpretation tutorial #1519

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 30 additions & 30 deletions tutorials/Bert_SQUAD_Interpret.ipynb
Original file line number Diff line number Diff line change
@@ -11,11 +11,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we demonstrate how to interpret Bert models using `Captum` library. In this particular case study we focus on a fine-tuned Question Answering model on SQUAD dataset using transformers library from Hugging Face: https://huggingface.co/transformers/\n",
"In this notebook we demonstrate how to interpret BERT models using the `Captum` library. In this particular case study we focus on a fine-tuned Question Answering model on the SQUAD dataset using a transformers library from Hugging Face: https://huggingface.co/transformers/.\n",
"\n",
"We show how to use interpretation hooks to examine and better understand embeddings, sub-embeddings, bert, and attention layers. \n",
"We show how to use interpretation hooks to examine and better understand embeddings, sub-embeddings, BERT, and attention layers. \n",
"\n",
"Note: Before running this tutorial, please install `seaborn`, `pandas` and `matplotlib`, `transformers`(from hugging face, tested on transformer version `4.3.0.dev0`) python packages."
"Note: Before running this tutorial, please install `seaborn`, `pandas`, `matplotlib`, and `transformers`(from hugging face, tested on transformer version `4.3.0.dev0`) python packages."
]
},
{
@@ -51,7 +51,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step is to fine-tune BERT model on SQUAD dataset. This can be easiy accomplished by following the steps described in hugging face's official web site: https://github.com/huggingface/transformers#run_squadpy-fine-tuning-on-squad-for-question-answering \n",
"The first step is to fine-tune the BERT model on the SQUAD dataset. This can be easiy accomplished by following the steps described in hugging face's official web site: https://github.com/huggingface/transformers#run_squadpy-fine-tuning-on-squad-for-question-answering.\n",
"\n",
"Note that the fine-tuning is done on a `bert-base-uncased` pre-trained model."
]
@@ -105,7 +105,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Defining a custom forward function that will allow us to access the start and end postitions of our prediction using `position` input argument."
"Defining a custom forward function that will allow us to access the start and end postitions of our prediction using the `position` input argument."
]
},
{
@@ -129,9 +129,9 @@
"source": [
"Let's compute attributions with respect to the `BertEmbeddings` layer.\n",
"\n",
"To do so, we need to define baselines / references, numericalize both the baselines and the inputs. We will define helper functions to achieve that.\n",
"To do so, we need to define baselines / references, and numericalize both the baselines and the inputs. We will define helper functions to achieve that.\n",
"\n",
"The cell below defines numericalized special tokens that will be later used for constructing inputs and corresponding baselines/references."
"The cell below defines numericalized special tokens that will be later used for constructing inputs and corresponding baselines / references."
]
},
{
@@ -149,7 +149,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we define a set of helper function for constructing references / baselines for word tokens, token types and position ids. We also provide separate helper functions that allow to construct attention masks and bert embeddings both for input and reference."
"Below we define a set of helper function for constructing baselines / references for word tokens, token types and position IDs. We also provide separate helper functions that allow to construct attention masks and BERT embeddings both for input and reference."
]
},
{
@@ -203,7 +203,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define the `question - text` pair that we'd like to use as an input for our Bert model and interpret what the model was forcusing on when predicting an answer to the question from given input text "
"Let's define the `question - text` pair that we'd like to use as an input for our BERT model and interpret what the model was focusing on when predicting an answer to the question from a given input text."
]
},
{
@@ -261,7 +261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's make predictions using input, token type, position id and a default attention mask."
"Now let's make predictions using input, token type, position ID and a default attention mask."
]
},
{
@@ -293,7 +293,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two different ways of computing the attributions for emebdding layers. One option is to use `LayerIntegratedGradients` and compute the attributions with respect to `BertEmbedding`. The second option is to use `LayerIntegratedGradients` for each `word_embeddings`, `token_type_embeddings` and `position_embeddings` and compute the attributions w.r.t each embedding vector.\n"
"There are two different ways of computing the attributions for emebedding layers. One option is to use `LayerIntegratedGradients` and compute the attributions with respect to `BertEmbedding`. The second option is to use `LayerIntegratedGradients` for each `word_embeddings`, `token_type_embeddings` and `position_embeddings` and compute the attributions with respect to each embedding vector.\n"
]
},
{
@@ -472,9 +472,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look into the sub-embeddings of `BerEmbeddings` and try to understand the contributions and roles of each of them for both start and end predicted positions.\n",
"Now let's look into the sub-embeddings of `BertEmbeddings` and try to understand the contributions and roles of each of them for both start and end predicted positions.\n",
"\n",
"To do so, we will use `LayerIntegratedGradients` for all three layer: `word_embeddings`, `token_type_embeddings` and `position_embeddings`."
"To do so, we will use `LayerIntegratedGradients` for all three layers: `word_embeddings`, `token_type_embeddings` and `position_embeddings`."
]
},
{
@@ -516,7 +516,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"An auxilary function that will help us to compute topk attributions and corresponding indices"
"An auxiliary function that will help us to compute top-K attributions and corresponding indices"
]
},
{
@@ -542,7 +542,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Computing topk attributions for all sub-embeddings and placing them in pandas dataframes for better visualization."
"Computing top-K attributions for all sub-embeddings and placing them in pandas dataframes for better visualization."
]
},
{
@@ -613,7 +613,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we can see top 5 attribution results from all three embedding types in predicting start positions."
"Below we can see the top 5 attribution results from all three embedding types in predicting start positions."
]
},
{
@@ -720,11 +720,11 @@
"source": [
"Word embeddings help to focus more on the surrounding tokens of the predicted answer's start position to such as em, ##power and ,. It also has high attribution for the tokens in the question such as what and ?.\n",
"\n",
"In contrast to to word embedding, token embedding type focuses more on the tokens in the text part such as important,em and start token to.\n",
"In contrast to word embedding, token embedding type focuses more on the tokens in the text part such as important,em and start token to.\n",
"\n",
"Position embedding also has high attribution score for the tokens surrounding to such as us and important. In addition to that, similar to word embedding we observe important tokens from the question.\n",
"\n",
"We can perform similar analysis, and visualize top 5 attributed tokens for all three embedding types, also for the end position prediction.\n"
"We can perform a similar analysis, and visualize top 5 attributed tokens for all three embedding types, also for the end position prediction.\n"
]
},
{
@@ -829,7 +829,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It is interesting to observe high concentration of highly attributed tokens such as `of`, `kinds`, `support` and `##power` for end position prediction.\n",
"It is interesting to observe a high concentration of highly attributed tokens such as `of`, `kinds`, `support` and `##power` for end position prediction.\n",
"\n",
"The token `kinds`, which is the correct predicted token appears to have high attribution score both according word and position embeddings.\n"
]
@@ -838,22 +838,22 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Interpreting Bert Layers"
"# Interpreting BERT Layers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look into the layers of our network. More specifically we would like to look into the distribution of attribution scores for each token across all layers in Bert model and dive deeper into specific tokens. \n",
"Now let's look into the layers of our network. More specifically we would like to look into the distribution of attribution scores for each token across all layers of the BERT model and dive deeper into specific tokens.\n",
"We do that using one of layer attribution algorithms, namely, layer conductance. However, we encourage you to try out and compare the results with other algorithms as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define another version of squad forward function that takes emebddings as input argument. This is necessary for `LayerConductance` algorithm."
"Let's define another version of squad forward function that takes emebddings as an input argument. This is necessary for `LayerConductance` algorithm."
]
},
{
@@ -872,10 +872,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's iterate over all layers and compute the attributions for all tokens. In addition to that let's also choose a specific token that we would like to examine in detail, specified by an id `token_to_explain` and store related information in a separate array.\n",
"Let's iterate over all layers and compute the attributions for all tokens. In addition to that let's also choose a specific token that we would like to examine in detail, specified by an ID `token_to_explain` and store related information in a separate array.\n",
"\n",
"\n",
"Note: Since below code is iterating over all layers it can take over 5 seconds. Please be patient!"
"Note: Since the below code is iterating over all layers it can take over 5 seconds. Please be patient!"
]
},
{
@@ -914,9 +914,9 @@
"source": [
"The plot below represents a heat map of attributions across all layers and tokens for the start position prediction. \n",
"It is interesting to observe that the question word `what` gains increasingly high attribution from layer one to nine. In the last three layers that importance is slowly diminishing. \n",
"In contrary to `what` token, many other tokens have negative or close to zero attribution in the first 6 layers. \n",
"In contrast to `what` token, many other tokens have negative or close to zero attribution in the first 6 layers. \n",
"\n",
"We start seeing slightly higher attribution in tokens `important`, `us` and `to`. Interestingly token `em` is also assigned high attribution score which is remarkably high the last three layers.\n",
"We start seeing slightly higher attribution in tokens `important`, `us` and `to`. Interestingly token `em` is also assigned high attribution score which is remarkably high in the last three layers.\n",
"And lastly, our correctly predicted token `to` for the start position gains increasingly positive attribution has relatively high attribution especially in the last two layers.\n"
]
},
@@ -1115,7 +1115,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot below visualizes the probability mass function (pmf) of attributions for each layer for the end position token `kinds`. From the plot we can observe that the distributions are taking bell-curved shapes with different means and variances.\n",
"The plot below visualizes the probability mass function (pmf) of attributions for each layer for the end position token `kinds`. From the plot we can observe that the distributions are approximately bell-curved shapes with different means and variances.\n",
"We can now use attribution pdfs to compute entropies in the next cell."
]
},
@@ -1150,9 +1150,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we calculate and visualize attribution entropies based on Shannon entropy measure where the x-axis corresponds to the number of layers and the y-axis corresponds to the total attribution in that layer. The size of the circles for each (layer, total_attribution) pair correspond to the normalized entropy value at that point.\n",
"Below we calculate and visualize attribution entropies based on the Shannon entropy measure where the x-axis corresponds to the number of layers and the y-axis corresponds to the total attribution in that layer. The size of the circles for each (layer, total_attribution) pair correspond to the normalized entropy value at that point.\n",
"\n",
"In this particular example, we observe that the entropy doesn't change much from layer to layer, however in a general case entropy can provide us an intuition about the distributional characteristics of attributions in each layer and can be useful especially when comparing it across multiple tokens.\n"
"In this particular example, we observe that the entropy doesn't change much from layer to layer, however in a general case entropy can provide an intuition about the distributional characteristics of attributions in each layer and can be useful especially when comparing it across multiple tokens.\n"
]
},
{
@@ -1193,7 +1193,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the Part 2 of this tutorial we will to go deeper into attention layers, heads and compare the attributions with the attention weight matrices, study and discuss related statistics."
"In the Part 2 of this tutorial we will go deeper into attention layers, heads and compare the attributions with the attention weight matrices, study and discuss related statistics."
]
}
],