Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Filter gradients of IndexedSlices in tf grad sim backend, fixes #828 #829

Merged
merged 35 commits into from
Jan 11, 2023

Conversation

mauicv
Copy link
Contributor

@mauicv mauicv commented Nov 30, 2022

What is this:

Bugfix for #828.

Implementation:

Having spoken to @RobertSamoilescu we've agreed on the following plan of action:

  • If the tensor is in a sparse format we try to convert it to dense and then to np.ndarray. This will also need to be done for sparse representations in pytorch.
  • Finally we filter layers with trainable=False
  • If a user passes a model that has some layers set to trainable=False then we raise warning to make it clear that those layers won't be used in the computation. Note this should only happen in the model init call!
  • Check non-trainable warning behaviour for batch layers...
  • Raise informative error if specific layer doesn't convert to numpy
  • Instead of filtering for batch norm layers in trainable layers check, just list non-trainable layers in warning.

More Todo:

  • Ensure correct formatting of module and variable names in docstrings: numpy -> 'numpy'
  • Ensure public methods all have correct docstrings with return types etc...
  • Add note to FAQ explaining warnings

Issues:

  1. If we fail to convert a sparse tensor to dense or the gradient tensor has no numpy attribute the original plan was to raise a warning informing the user that some tensors grads are not used.

    • Having thought further about this I'm not sure about this? Seems like we'd end up catching a broad range of errors. In principle any model passed should have gradients computable for all layers so it's not clear that we should discard errors that might arise here!?
    • If an error due to something like the above arises we can suggest the user set trainable to false on the layer in question while we figure out a fix? I'm open to doing this differently though!
    • UPDATE: having spoken to @RobertSamoilescu, we've agreed it makes sense to raise error here if we can't the gradient to a numpy array. Other errors will not be attempted to be caught. The error will also suggest tot he user they mark the gradinet as non-trainable so they can still get results.
  2. tensorflow has two representations of sparsity, 1. IndexedSlice which arises as gradients of embedding layers as we've seen and 2. SparceTensor. They have different methods of converting to dense representations for instance the below needs to be used:

    if isinstance(grad, tf.SparseTensor):
      grad = tf.sparse.to_dense(grad)
    if isinstance(grad, tf.IndexedSlices):
      grad = tf.convert_to_tensor(grad)

    The issue I have is that I can't find any indication that tensorflow layers would ever have sparse gradients other than the IndexedSlices that results from tf.gather in the embedding layer. So I'm not sure the above case is needed. You certainly can't take gradients w.r.t. SparseTensors. Doing so yeilds:

    ValueError: Type SparseTensor is not supported as a gradient source or gradient target.
    

    @jklaise and @RobertSamoilescu can either of you think of any situation in which this would arise? For now, I'm going to remove that logic. If an error due to something like the above arises we can suggest the user set trainable to false on the layer in question while we figure out a fix

  3. Checking weather or not models contain non-trainable layers is a little non trivial. For instance the batch_normalization layer has non-trainable weights and so if a user includes this in there model naive solutions would trigger this warning. I've set it to filter the batch norm layer as according to the tensorflow docs this is the only layer that is non-trainable. A user might create a custom layer which would throw warnings but I think that's ok as they should be alerted in this case...
    UPDATE:: having spoken to @RobertSamoilescu we've agreed to remove the checks for batch normalization as this kind of approach is inherently non-exhaustive and instead replace with a more informative warning that lists all layers not included in grad sim computation!

@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

Merging #829 (4bd1f20) into master (91a2a88) will increase coverage by 0.62%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #829      +/-   ##
==========================================
+ Coverage   84.65%   85.28%   +0.62%     
==========================================
  Files          74       74              
  Lines        8722     8759      +37     
==========================================
+ Hits         7384     7470      +86     
+ Misses       1338     1289      -49     
Flag Coverage Δ
macos-latest-3.10 ?
ubuntu-latest-3.10 85.23% <100.00%> (+0.62%) ⬆️
ubuntu-latest-3.7 85.03% <100.00%> (+0.51%) ⬆️
ubuntu-latest-3.8 85.17% <100.00%> (+0.62%) ⬆️
ubuntu-latest-3.9 85.17% <100.00%> (+0.62%) ⬆️
windows-latest-3.9 82.81% <100.00%> (+0.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
alibi/explainers/tests/test_simiarlity/conftest.py 100.00% <ø> (ø)
...ibi/explainers/similarity/backends/pytorch/base.py 95.74% <100.00%> (+1.80%) ⬆️
.../explainers/similarity/backends/tensorflow/base.py 100.00% <100.00%> (ø)
alibi/explainers/similarity/grad.py 95.65% <100.00%> (+0.33%) ⬆️
alibi/explainers/permutation_importance.py 97.09% <0.00%> (-0.02%) ⬇️
alibi/prototypes/protoselect.py 93.36% <0.00%> (+25.99%) ⬆️

@jklaise
Copy link
Contributor

jklaise commented Dec 9, 2022

Before merging this we should answer a few questions.

As a background, this error was raised when trying to calculate GradientSimilarity for a transformer model on text instances with respect to token ids (as opposed to embeddings). The error is to be expected since the layer mapping token ids to embeddings is not differentiable; however, this solution proved to be an effective workaround - just get rid of the results which correspond to non differentiable layers.

On the other hand, one could make the claim that the original (error) is expected behaviour as the user should be aware that the model they're passing in has non-differentiable layers. However, fixing the model so that gradients are only taken from the embedding layer onwards proved to be very difficult in the specific case of transformers models which included workarounds splitting the main model pipeline in several steps.

Therefore, the proposed solution here seems like a good alternative. To discuss and investigate:

  1. Can there be any potential issues when simply discarding results of non-differentiable layers like in this PR?
  2. Should we raise a warning if this happens?
  3. Is there a more principled way of filtering the results other than looking at IndexSlices (e.g. looking at the existence of gradients)?
  4. The pytorch backend should have the same behaviour if we are to implement this PR.
    @RobertSamoilescu @mauicv

@mauicv
Copy link
Contributor Author

mauicv commented Dec 13, 2022

Can there be any potential issues when simply discarding results of non-differentiable layers like in this PR? Should we raise a warning if this happens?

I can't think of any obvious ones. In terms of raising warnings, I think we'd only really be dealing with the case where the user expects the similarity to be taken with respect to some layer but doesn't realise that it's not differentiable and so it isn't. I'm not sure it's our responsibility to correct this sort of misunderstanding but it certainly wouldn't hurt so we might as well. Perhaps we'd prevent some number of open issues by doing the above.

Is there a more principled way of filtering the results other than looking at IndexSlices (e.g. looking at the existence of gradients)?

Yes, thinking about it after the fact, IndexSlices was not the correct approach. We should filter on hasattr(layer, 'grad') and getattr(layer, 'grad') is not None or whatever the tensorflow equivalent is!!

The PyTorch backend should have the same behaviour if we are to implement this PR.
@RobertSamoilescu @mauicv

I'll test and see how a non-differentiable layer behaves in the PyTorch case, but agree that the two back ends should match

@mauicv
Copy link
Contributor Author

mauicv commented Jan 5, 2023

Having looked into this further it's not actually the case that the embedding layer is non-differentiable (see this explanation). It's just that the tf.keras.layers.Embedding represents the gradients as Sparse tensors in the form tf.IndexedSlices and this is where the error arises as these don't have .numpy() method.

Moving forward it makes sense to either

  1. filter for IndexedSlices or on something like hasattr(grad, 'numpy').
  2. Or find a way to convert them to the numpy gradinet tensor...

Note: pytorch allows you to set the embedding layer gradients as sparse or not using a kwarg in the Embedding layer this defaults to False however. But in the case that a user sets sparse=True we'll get the error:

TypeError: can't convert Sparse layout tensor to numpy. Use Tensor.dense() first.

@jklaise
Copy link
Contributor

jklaise commented Jan 5, 2023

Nice find. Would it be more correct to convert all sparse layers to dense and take gradients into account rather than filtering them out? On the other hand, are we risking memory blowing up?

@mauicv
Copy link
Contributor Author

mauicv commented Jan 5, 2023

Having spoken to @RobertSamoilescu we've agreed on the following plan of action:

  • If the tensor is in a sparse format we try to convert it to dense and then to np.ndarray. This will also need to be done for sparse representations in pytorch.
  • If the above fails or the tensor has no numpy attribute we raise a warning informing the user that some tensors grads are not used.
    • Having thought further about this I'm not sure about this? Seems like we'd end up catching a broad range of errors. In principle any model passed should have gradients computable for all layers so it's not clear that we should discard errors that might arise here!?
    • If an error due to something like the above arises we can suggest the user set trainable to false on the layer in question while we figure out a fix? I'm open to doing this differently if that's the preference though!
  • Finally we filter layers with trainable=False
  • If a user passes a model that has some layers set to trainable=False then we raise warning to make it clear that those layers won't be used in the computation. Note this should only happen in the model init call!
  • Check non-trainable warning behaviour for batch layers...
    • Checking weather or not models contain non-trainable layers is a little non trivial. For instance the batch_normalization layer has non-trainable weights and so if a user includes this in there model naive solutions would trigger this warning. I've set it to filter the batch norm layer as according to the tensorflow docs this is the only layer that is non-trainable. A user might create a custom layer which would throw warnings but I think that's ok as they should be alerted in this case...

@mauicv
Copy link
Contributor Author

mauicv commented Jan 5, 2023

Nice find. Would it be more correct to convert all sparse layers to dense and take gradients into account rather than filtering them out? On the other hand, are we risking memory blowing up?

w.r.t. memory usage, agreed this is potentially an issue if the vocabulary or the embedding dimension is very big. I think the correct solution to this is to maintain the sparse representation of the layer grads while computing the similarity. So the get_grads method would return sparse_grads and dense_grads. This is a bit trickier though so we decided to consider it for a later PR. In the meantime, if a user wants to reduce the memory footprint they can set trainable to False on layers...

@mauicv
Copy link
Contributor Author

mauicv commented Jan 5, 2023

tensorflow has two representations of sparsity, 1. IndexedSlice which arises as gradients of embedding layers as we've seen and 2. SparceTensor. They have different methods of converting to dense representations for instance the below needs to be used:

if isinstance(grad, tf.SparseTensor):
  grad = tf.sparse.to_dense(grad)
if isinstance(grad, tf.IndexedSlices):
  grad = tf.convert_to_tensor(grad)

The issue I have is that I can't find any indication that tensorflow layers would ever have sparse gradients other than the IndexedSlices that results from tf.gather in the embedding layer. So I'm not sure the above case is needed. You certainly can't take gradients w.r.t. SparseTensors. Doing so yeilds:

ValueError: Type SparseTensor is not supported as a gradient source or gradient target.

Can either of you think of any situation in which this would arise? For now, I'm going to remove that logic and issue a warning if the layer can't be converted to numpy as planned.


@staticmethod
def get_non_trainable(model: nn.Module) -> List[Union[int, str]]:
"""Checks that all layers in a model are trainable."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docstrings with something descriptive of what the function is actually doing.


@staticmethod
def get_non_trainable(model: keras.Model) -> List[Union[int, str]]:
"""Checks if all layers in a model are trainable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring as for pytorch backend.


Note: batch normalization layers are ignored as they are not trainable by default.
"""
return [getattr(layer, 'name', i) for i, layer in enumerate(model.layers) if not layer.trainable]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach does not work as for the pytorch backend.
For example, consider the model:

model = keras.Sequential([
    keras.layers.Dense(10),
    keras.layers.Dense(20),
    keras.Sequential([
        keras.layers.Dense(30),
        keras.layers.Dense(40),
    ])    
])

model.build((None, 10))

Calling model.layers will return

[<keras.layers.core.dense.Dense at 0x7f54d4332430>,
 <keras.layers.core.dense.Dense at 0x7f54d43323d0>,
 <keras.engine.sequential.Sequential at 0x7f54d44dea30>]

Setting on layer inside the sequential layer to non-trainable:

model.layers[-1].layers[0].trainable = False

If we call your function, it returns an empty list, which is incorrect.

get_non_trainable(model)

If we call it with the sequential layer, then it returns the correct answer.

get_non_trainable(model.layers[-1])

This shows that the function does not work for nested layers, which are very common. Not sure if I am missing something and if the problem is addressed somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch! 🤦‍♂️ It's better if I just iterate through the model.non_trainable_weights here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a similar issue for tensorflow models which have nesting, e.g. any transformer model (see the model for our text app as an example). I think we should try a recursive approach.

Copy link
Contributor Author

@mauicv mauicv Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @RobertSamoilescu misspoke no? The above is a Keras model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by iterating through model.non_trainable_weights instead of layers and added tests.

@jklaise jklaise self-requested a review January 6, 2023 16:16
non_trainable_layers = self.backend.get_non_trainable(self.predictor)
if non_trainable_layers:
layers_msg = 'The following layers are not trainable: '
layers = ", ".join([f"'{layer}'" for layer in non_trainable_layers if layer is not None])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if layer is not None -> are there instances where the returned list containe None elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eager tensors don't have names... Although I've forgotten to handle them properly here as I'm enumerating and returning the layer index instead of the name. Ill need to check how that behaves though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to figure out this behaviour. EagerTensors don't have name attributes, so those layer names returned would be None. I'm leaving this in defensively so that if this behaviour does happen then an error won't occur but the user still receives an error message.

I've also added an error message for the case where no parameter in the model is trainable.

if param.grad is not None])

@staticmethod
def _grad_to_numpy(grad: torch.Tensor, name: Optional[str] = None) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grad type should probably be a union here since it can be sparse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this, was thinking of the tensorflow backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No your correct, torch also has SparseTensors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore me, torch handles sparse tensors by setting a layout attribute on normal tensors!

Comment on lines 70 to 71
raise TypeError((f'Could not convert gradient to numpy array{name}. To ignore these '
'gradients in the similarity computation use `requires_grad=False`.'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the user to easily do this? I'm thinking of the situation where a layer has multiple gradient tensors, e.g. weights and bias. Here we would be reporting the tensors not the layers?

A related point might be that this could be annoying if many such layers are present, as the error would be thrown for the first one, and then once it's fixed the next one etc. Should we consider checking the numpy attribute of all tensors first (after converting to dense) and then raise one error if there are any that don't have that attribute? Perhaps this is a minor point if in realistic scenarios we don't expect there to be more than a handful of such layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest I'm hoping there are no such layers. But it's hard to know without exhaustively checking all of them. If there is a model that does have lots of such layers then hopefully the user can figure out the pattern that identifies those layers and programmatically disable training on them but I agree it would be good to return a list instead, this way just seemed simple for an issue that it's not clear exists ... 😕 happy to do whatever though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it sounds like this check is more of a fail-safe and shouldn't really happen unless something is quite wrong? In which case I think it's fine as we at least have a readable error rather than a non-readable one.

Comment on lines 158 to 159
tf.keras.layers.Dense(output_shape),
tf.keras.layers.Dense(output_shape),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume the duplication was unintentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the same for pytorch, was an extra layer needed for extra tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the issue was there was only one layer that could be set to non-trainable and I felt we probably wanted some weights that were still trainable to make sure that the setting was doing what we want it to. I've replaced it in what I'm currently doing though because the model for checking for non-trainable layers is sufficiently complex now that it doesn't make sense to use the above fixture. i.e. I've added nested sequential layers etc...


return np.concatenate([_PytorchBackend._grad_to_numpy(grad=param.grad, name=name)
for name, param in model.named_parameters()
if param.grad is not None])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that torch tensors have a grad attribute by default set to None so we don't have to check it exist first!

Copy link
Contributor

@jklaise jklaise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few minor comments, plus the one we discussed offline here:

  • Testing this on a language model with all the transformer layers set to non-trainable results in a very large warning off the screen as we list all non-trainable parameters by name. Although the names could potentially be useful, we're likely better off changing the warning to output the number of non-trainable parameters detected and adding a few short sentences about the 2 possibilities (user having set them manually or model having some in by default e.g. batch norm).

@jklaise jklaise merged commit f29bc2f into SeldonIO:master Jan 11, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants