-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Filter gradients of IndexedSlices in tf grad sim backend, fixes #828 #829
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #829 +/- ##
==========================================
+ Coverage 84.65% 85.28% +0.62%
==========================================
Files 74 74
Lines 8722 8759 +37
==========================================
+ Hits 7384 7470 +86
+ Misses 1338 1289 -49
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Before merging this we should answer a few questions. As a background, this error was raised when trying to calculate On the other hand, one could make the claim that the original (error) is expected behaviour as the user should be aware that the model they're passing in has non-differentiable layers. However, fixing the model so that gradients are only taken from the embedding layer onwards proved to be very difficult in the specific case of Therefore, the proposed solution here seems like a good alternative. To discuss and investigate:
|
I can't think of any obvious ones. In terms of raising warnings, I think we'd only really be dealing with the case where the user expects the similarity to be taken with respect to some layer but doesn't realise that it's not differentiable and so it isn't. I'm not sure it's our responsibility to correct this sort of misunderstanding but it certainly wouldn't hurt so we might as well. Perhaps we'd prevent some number of open issues by doing the above.
Yes, thinking about it after the fact,
I'll test and see how a non-differentiable layer behaves in the |
Having looked into this further it's not actually the case that the embedding layer is non-differentiable (see this explanation). It's just that the Moving forward it makes sense to either
Note:
|
Nice find. Would it be more correct to convert all sparse layers to dense and take gradients into account rather than filtering them out? On the other hand, are we risking memory blowing up? |
Having spoken to @RobertSamoilescu we've agreed on the following plan of action:
|
w.r.t. memory usage, agreed this is potentially an issue if the vocabulary or the embedding dimension is very big. I think the correct solution to this is to maintain the sparse representation of the layer grads while computing the similarity. So the |
tensorflow has two representations of sparsity, 1. if isinstance(grad, tf.SparseTensor):
grad = tf.sparse.to_dense(grad)
if isinstance(grad, tf.IndexedSlices):
grad = tf.convert_to_tensor(grad) The issue I have is that I can't find any indication that tensorflow layers would ever have sparse gradients other than the
Can either of you think of any situation in which this would arise? For now, I'm going to remove that logic and issue a warning if the layer can't be converted to numpy as planned. |
|
||
@staticmethod | ||
def get_non_trainable(model: nn.Module) -> List[Union[int, str]]: | ||
"""Checks that all layers in a model are trainable.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docstrings with something descriptive of what the function is actually doing.
|
||
@staticmethod | ||
def get_non_trainable(model: keras.Model) -> List[Union[int, str]]: | ||
"""Checks if all layers in a model are trainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring as for pytorch
backend.
|
||
Note: batch normalization layers are ignored as they are not trainable by default. | ||
""" | ||
return [getattr(layer, 'name', i) for i, layer in enumerate(model.layers) if not layer.trainable] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach does not work as for the pytorch
backend.
For example, consider the model:
model = keras.Sequential([
keras.layers.Dense(10),
keras.layers.Dense(20),
keras.Sequential([
keras.layers.Dense(30),
keras.layers.Dense(40),
])
])
model.build((None, 10))
Calling model.layers
will return
[<keras.layers.core.dense.Dense at 0x7f54d4332430>,
<keras.layers.core.dense.Dense at 0x7f54d43323d0>,
<keras.engine.sequential.Sequential at 0x7f54d44dea30>]
Setting on layer inside the sequential layer to non-trainable:
model.layers[-1].layers[0].trainable = False
If we call your function, it returns an empty list, which is incorrect.
get_non_trainable(model)
If we call it with the sequential layer, then it returns the correct answer.
get_non_trainable(model.layers[-1])
This shows that the function does not work for nested layers, which are very common. Not sure if I am missing something and if the problem is addressed somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch! 🤦♂️ It's better if I just iterate through the model.non_trainable_weights
here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a similar issue for tensorflow
models which have nesting, e.g. any transformer model (see the model for our text app as an example). I think we should try a recursive approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @RobertSamoilescu misspoke no? The above is a Keras
model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by iterating through model.non_trainable_weights
instead of layers
and added tests.
alibi/explainers/similarity/grad.py
Outdated
non_trainable_layers = self.backend.get_non_trainable(self.predictor) | ||
if non_trainable_layers: | ||
layers_msg = 'The following layers are not trainable: ' | ||
layers = ", ".join([f"'{layer}'" for layer in non_trainable_layers if layer is not None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if layer is not None
-> are there instances where the returned list containe None
elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eager tensors don't have names... Although I've forgotten to handle them properly here as I'm enumerating and returning the layer index instead of the name. Ill need to check how that behaves though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to figure out this behaviour. EagerTensors
don't have name attributes, so those layer names returned would be None
. I'm leaving this in defensively so that if this behaviour does happen then an error won't occur but the user still receives an error message.
I've also added an error message for the case where no parameter in the model is trainable.
if param.grad is not None]) | ||
|
||
@staticmethod | ||
def _grad_to_numpy(grad: torch.Tensor, name: Optional[str] = None) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grad
type should probably be a union here since it can be sparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this, was thinking of the tensorflow
backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No your correct, torch also has SparseTensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore me, torch handles sparse tensors by setting a layout attribute on normal tensors!
raise TypeError((f'Could not convert gradient to numpy array{name}. To ignore these ' | ||
'gradients in the similarity computation use `requires_grad=False`.')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for the user to easily do this? I'm thinking of the situation where a layer has multiple gradient tensors, e.g. weights and bias. Here we would be reporting the tensors not the layers?
A related point might be that this could be annoying if many such layers are present, as the error would be thrown for the first one, and then once it's fixed the next one etc. Should we consider checking the numpy
attribute of all tensors first (after converting to dense) and then raise one error if there are any that don't have that attribute? Perhaps this is a minor point if in realistic scenarios we don't expect there to be more than a handful of such layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest I'm hoping there are no such layers. But it's hard to know without exhaustively checking all of them. If there is a model that does have lots of such layers then hopefully the user can figure out the pattern that identifies those layers and programmatically disable training on them but I agree it would be good to return a list instead, this way just seemed simple for an issue that it's not clear exists ... 😕 happy to do whatever though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like this check is more of a fail-safe and shouldn't really happen unless something is quite wrong? In which case I think it's fine as we at least have a readable error rather than a non-readable one.
tf.keras.layers.Dense(output_shape), | ||
tf.keras.layers.Dense(output_shape), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume the duplication was unintentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the same for pytorch
, was an extra layer needed for extra tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the issue was there was only one layer that could be set to non-trainable and I felt we probably wanted some weights that were still trainable to make sure that the setting was doing what we want it to. I've replaced it in what I'm currently doing though because the model for checking for non-trainable layers is sufficiently complex now that it doesn't make sense to use the above fixture. i.e. I've added nested sequential layers etc...
|
||
return np.concatenate([_PytorchBackend._grad_to_numpy(grad=param.grad, name=name) | ||
for name, param in model.named_parameters() | ||
if param.grad is not None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that torch tensors have a grad
attribute by default set to None
so we don't have to check it exist first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left a few minor comments, plus the one we discussed offline here:
- Testing this on a language model with all the transformer layers set to non-trainable results in a very large warning off the screen as we list all non-trainable parameters by name. Although the names could potentially be useful, we're likely better off changing the warning to output the number of non-trainable parameters detected and adding a few short sentences about the 2 possibilities (user having set them manually or model having some in by default e.g. batch norm).
What is this:
Bugfix for #828.
Implementation:
Having spoken to @RobertSamoilescu we've agreed on the following plan of action:
np.ndarray
. This will also need to be done for sparse representations in pytorch.trainable=False
trainable=False
then we raise warning to make it clear that those layers won't be used in the computation. Note this should only happen in the model init call!More Todo:
'numpy'
Issues:
If we fail to convert a sparse tensor to dense or the gradient tensor has no numpy attribute the original plan was to raise a warning informing the user that some tensors grads are not used.
tensorflow has two representations of sparsity, 1.
IndexedSlice
which arises as gradients of embedding layers as we've seen and 2.SparceTensor
. They have different methods of converting to dense representations for instance the below needs to be used:The issue I have is that I can't find any indication that tensorflow layers would ever have sparse gradients other than the
IndexedSlices
that results fromtf.gather
in the embedding layer. So I'm not sure the above case is needed. You certainly can't take gradients w.r.t.SparseTensors
. Doing so yeilds:@jklaise and @RobertSamoilescu can either of you think of any situation in which this would arise? For now, I'm going to remove that logic. If an error due to something like the above arises we can suggest the user set trainable to false on the layer in question while we figure out a fix
Checking weather or not models contain non-trainable layers is a little non trivial. For instance the
batch_normalization
layer has non-trainable weights and so if a user includes this in there model naive solutions would trigger this warning. I've set it to filter the batch norm layer as according to the tensorflow docs this is the only layer that is non-trainable. A user might create a custom layer which would throw warnings but I think that's ok as they should be alerted in this case...UPDATE:: having spoken to @RobertSamoilescu we've agreed to remove the checks for batch normalization as this kind of approach is inherently non-exhaustive and instead replace with a more informative warning that lists all layers not included in grad sim computation!