PyTorch 101 Part 5 - Understanding Hooks 

{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Debugging and Visualisation in PyTorch with hooks and Tensorboard","version":"0.3.2","provenance":[],"collapsed_sections":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"oX5pdVjhFB7Q","colab_type":"text"},"source":["# Debugging and Visualisation in PyTorch with hooks and Tensorboard\n","\n","Hello readers. Welcome to our tutorial on debugging and Visualisation in PyTorch. This is, for atleast now, is the last part of our PyTorch series start from basic understanding of graphs, all the way to this tutorial. In this tutorial we will cover. \n","\n","So let's get started. \n","\n","## Understanding PyTorch Hooks\n","\n","Hooks in PyTorch are severely underdocumented for the functionality they bring to the table. Consider them like the the Doctor Fate of the superheroes. Haven't heard of him? Exactly. That's the point. \n","\n","\n","One of the reason I like hooks so much is that they provide you to do things during backpropagation. A hook is like a one of those devices that many heroes leave behind in the villain's den to get all the inform. You can *register* a hook on a `Tensor` or a `nn.Module`. A hook is basically a function that is executed when the either `forward` or `backward` is called.\n","\n","When I say `forward`, I don't mean the `forward` of a `nn.Module` function where we describe how we will compute the output. `forward` function here means the `forward`function of the `torch.Autograd.Function` object that is the `grad_fn` of a Tensor. Last line seem gibberish to you? I recommend you to please checkout our article on computation graph in PyTorch. If you are just being lazy, then understand every tensor has a `grad_fn` which is the `torch.Autograd.Function` object which created the tensor. For example, if a tensor is created by `tens = tens1 + tens2`, it's `grad_fn` is `AddBackward`. Still doesn't make sense? You should definitely go back and read this article.\n","\n","Notice, that a `nn.Module` like a `nn.Linear` has multiple `forward` invocations. It's output is created by two operations, (Y = W * X + B), addition and multiplication and thus there will be two `forward` calls. This can mess things up, and can lead to multiple outputs. We will touch this in more detail later in this article. \n","\n","PyTorch provides two types of hooks.\n","\n","1. The Forward Hook \n","2. The Backward Hook\n","\n","A forward hook is executed during the forward pass, while the backward hook is , well, you guessed it, executed when the `backward` function is called. Time to remind you again, these are the `forward` and `backward` functions of an `Autograd.Function` object. \n","\n","### Hooks for Tensors\n","\n","\n","A hook is basically a function, with a very specific signature. When we say a hook is executed, in reality, we are talking about this function being executed. \n","\n","For tensors, the signature for backward hook is,  \n","\n","```\n","hook(grad) -> Tensor or None\n","\n","```\n","There is no `forward` hook for a tensor. \n","\n","`grad` is basically the value contained in the `grad` attribute of the tensor **after** `backward` is called. The function is not supposed modify it's argument. It must either return `None` or a Tensor which will be used in place of `grad` for further gradient computation. We provide an example below.  "]},{"cell_type":"code","metadata":{"id":"PAvIZINMAZoS","colab_type":"code","outputId":"62bc06c0-4e77-4da8-ecc4-821045cfecea","executionInfo":{"status":"ok","timestamp":1556995317007,"user_tz":-330,"elapsed":1413,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":68}},"source":["import torch \n","a = torch.ones(5)\n","a.requires_grad = True\n","\n","b = 2*a\n","\n","b.retain_grad()   # Since b is non-leaf and it's grad will be destroyed otherwise.\n","\n","c = b.mean()\n","\n","c.backward()\n","\n","print(a.grad, b.grad)\n","\n","# Redo the experiment but with a hook that multiplies b's grad by 2. \n","a = torch.ones(5)\n","\n","a.requires_grad = True\n","\n","b = 2*a\n","\n","b.retain_grad()\n","\n","b.register_hook(lambda x: print(x))  \n","\n","b.mean().backward() \n","\n","\n","print(a.grad, b.grad)"],"execution_count":1,"outputs":[{"output_type":"stream","text":["tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000]) tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])\n","tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])\n","tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000]) tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"9efEHVv0CTDI","colab_type":"text"},"source":["There are several uses of functionality as above. \n","\n","1. You can print the value of gradient for debugging. You can also log them. This is especially useful with non-leaf variables whose gradients are freed up unless you call `retain_grad` upon them. Doing the latter can lead to increased memory retention. Hooks provide much cleaner way to aggregate these values. \n","\n","2. You can modify gradients **during** the backward pass. This is very important. While you can still access the the `grad` variable of a tensor in a network, you can only access it after the **entire** backward pass has been done. For example, let us consider what we did above. We multplied `b`'s gradient by 2, and now the subsequent gradient calculations, like those of `a` (or any tensor that will depend upon `b` for gradient) use the 2 * grad(b) instead of grad(b). In contrast, had we individually updated the parameters **after** the `backward`, we'd have to multiply `b.grad` as well as `a.grad` (or infact, all tensors that depend on `b` for gradient) by 2.   "]},{"cell_type":"code","metadata":{"id":"4iapn8WlEFVt","colab_type":"code","outputId":"efc9f2ab-6e8d-4183-bc34-f5e8442d6eef","executionInfo":{"status":"ok","timestamp":1556995323651,"user_tz":-330,"elapsed":1076,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":51}},"source":["a = torch.ones(5)\n","\n","a.requires_grad = True\n","\n","b = 2*a\n","\n","b.retain_grad()\n","\n","\n","b.mean().backward() \n","\n","\n","print(a.grad, b.grad)\n","\n","b.grad *= 2\n","\n","print(a.grad, b.grad)       # a's gradient needs to updated manually\n"],"execution_count":2,"outputs":[{"output_type":"stream","text":["tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000]) tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])\n","tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000]) tensor([0.4000, 0.4000, 0.4000, 0.4000, 0.4000])\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"VjSNDwhUATRl","colab_type":"text"},"source":["### Hooks for nn.Module objects\n","\n","For `nn.Module` object, the signature for the hook fuction, \n","\n","```\n","hook(module, grad_input, grad_output) -> Tensor or None\n","```\n","\n","for the backward hook, and \n","\n","```\n","hook(module, input, output) -> None\n","\n","```\n","\n","for the forward hook. \n","\n","\n","Before we begin, let me make it clear that I'm not a fan of using hooks on `nn.Module` objects. First, because they force us to break abstraction. A `nn.Module` is supposed to be a modularised object representing a layer. However, a `hook` is subjected a `forward` and a `backward`, of which there can be an arbitary number in a `nn.Module` object. \n","\n","For example, a `nn.Linear` involves two `forward` calls during it's execution. Multiplication and Addition ( y = w ** * ** x ** + ** b). This is why the `input` to the hook function can be a tuple containing the inputs to two different `forward` calls. `output` s the output of the forward call. \n","\n","`grad_input` is the gradient of the input of `nn.Module` object w.r.t to the loss ( dL / dx, dL / dw, dL / b). `grad_output` is the gradient of the output of the `nn.Module` object w.r.t to the gradient. These can be pretty ambigious for reason of multiple calls inside a `nn.Module` object."]},{"cell_type":"code","metadata":{"id":"OWPIKnEFgNsk","colab_type":"code","outputId":"ad08089a-0ada-4c1b-c17d-36812a1ec236","executionInfo":{"status":"ok","timestamp":1556995328265,"user_tz":-330,"elapsed":1050,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":306}},"source":["import torch \n","import torch.nn as nn\n","\n","class myNet(nn.Module):\n","  def __init__(self):\n","    super().__init__()\n","    self.conv = nn.Conv2d(3,10,2, stride = 2)\n","    self.relu = nn.ReLU()\n","    self.flatten = lambda x: x.view(-1)\n","    self.fc1 = nn.Linear(160,5)\n","   \n","  \n","  def forward(self, x):\n","    x = self.relu(self.conv(x))\n","    return self.fc1(self.flatten(x))\n","  \n","\n","net = myNet()\n","\n","def hook_fn(m, i, o):\n","  print(m)\n","  print(\"------------Input Grad------------\")\n","\n","  for grad in i:\n","    try:\n","      print(grad.shape)\n","    except AttributeError: \n","      print (\"None found for Gradient\")\n","\n","  print(\"------------Output Grad------------\")\n","  for grad in o:  \n","    try:\n","      print(grad.shape)\n","    except AttributeError: \n","      print (\"None found for Gradient\")\n","  print(\"\\n\")\n","net.conv.register_backward_hook(hook_fn)\n","net.fc1.register_backward_hook(hook_fn)\n","inp = torch.randn(1,3,8,8)\n","out = net(inp)\n","\n","(1 - out.mean()).backward()"],"execution_count":3,"outputs":[{"output_type":"stream","text":["Linear(in_features=160, out_features=5, bias=True)\n","------------Input Grad------------\n","torch.Size([5])\n","torch.Size([5])\n","------------Output Grad------------\n","torch.Size([5])\n","\n","\n","Conv2d(3, 10, kernel_size=(2, 2), stride=(2, 2))\n","------------Input Grad------------\n","None found for Gradient\n","torch.Size([10, 3, 2, 2])\n","torch.Size([10])\n","------------Output Grad------------\n","torch.Size([1, 10, 4, 4])\n","\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"pH4KMzsMKWHA","colab_type":"text"},"source":["In the code above, I use a hook to print the shapes of `grad_input` and `grad_output`. Now my knowledge about this may be limited, and please do comment if you have a alternative, but for the love of pink floyd, I cannot figure out what `grad_input` is supposed to represent what? \n","\n","In `conv2d` you can guess by shape. The `grad_input` of size `[10, 3, 3, 2]` is the grad of weights. That of `[10]` is maybe `bias`. But what about grad of input feature maps. `None`? Add to that `Conv2d` uses `im2col` or it's cousin to flatten an image such that convolutional over the whole image can be done through matrix computation and not looping. Were there any `backward` calls there. So in order to get the gradient of x, I'll have to call the `grad_output` of layer just behind it?\n","\n","The `linear` is baffling. Both the `grad_inputs` are size `[5]` but shouldn't the weight matrix of the linear layer be `160 x 5`. \n","\n","For such confusion I'm not a fan of using hooks with `nn.Modules`. You could do it for simple things like ReLU, but for complicated things? Not my cup of tea.\n","\n","### Proper Way of Using Hooks : An Opinion\n","\n","So, I'm all up for using hooks on Tensors. Using `named_parameters` functions, I've been sucessfully been able to accomplish all my gradient modifying / clipping needs using PyTorch. `named_parameters` allows us much much more control over which gradients to tinker with. Let's just say, I wanna do two things. \n","\n","1. Turn  gradients of linear biases into zero while backpropagating. \n","2. Make sure that for no gradient going to conv layer is less than 0."]},{"cell_type":"markdown","metadata":{"id":"mZBKBy_UaNf_","colab_type":"text"},"source":[""]},{"cell_type":"code","metadata":{"id":"JgyOA0VM8oB4","colab_type":"code","outputId":"213948c4-aba3-4b4c-b13c-b5af2918a669","executionInfo":{"status":"ok","timestamp":1556995331948,"user_tz":-330,"elapsed":1023,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":51}},"source":["\n","import torch \n","import torch.nn as nn\n","\n","class myNet(nn.Module):\n","  def __init__(self):\n","    super().__init__()\n","    self.conv = nn.Conv2d(3,10,2, stride = 2)\n","    self.relu = nn.ReLU()\n","    self.flatten = lambda x: x.view(-1)\n","    self.fc1 = nn.Linear(160,5)\n","   \n","  \n","  def forward(self, x):\n","    x = self.relu(self.conv(x))\n","    x.register_hook(lambda grad : torch.clamp(grad, min = 0))     #No gradient shall be backpropagated \n","                                                                  #conv outside less than 0\n","      \n","    # print whether there is any negative grad\n","    x.register_hook(lambda grad: print(\"Gradients less than zero:\", bool((grad < 0).any())))  \n","    return self.fc1(self.flatten(x))\n","  \n","\n","net = myNet()\n","\n","for name, param in net.named_parameters():\n","  # if the param is from a linear and is a bias\n","  if \"fc\" in name and \"bias\" in name:\n","    param.register_hook(lambda grad: torch.zeros(grad.shape))\n","\n","\n","out = net(torch.randn(1,3,8,8)) \n","\n","(1 - out).mean().backward()\n","\n","print(\"The biases are\", net.fc1.bias.grad)             #bias grads are zero\n","\n"],"execution_count":4,"outputs":[{"output_type":"stream","text":["Gradients less than zero: False\n","The biases are tensor([0., 0., 0., 0., 0.])\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"l-tdfF4baPg4","colab_type":"text"},"source":["## The Forward Hook for Visualising Activations\n","\n","If you noticed, the `Tensor` doesn't have a forward hook, while `nn.Module` has one, which is executed when a `forward` is called. Notwithstanding the issues I already highlighted with attaching hooks to PyTorch, I've seen many people use forward hooks to save intermediate feature maps by saving it to the feature maps to a python variable external to the hook function. Something like this. "]},{"cell_type":"code","metadata":{"id":"iQq-02Xja5c5","colab_type":"code","outputId":"45f2fb2b-fe73-4cfd-8f17-37490ce61663","executionInfo":{"status":"ok","timestamp":1556833990591,"user_tz":-330,"elapsed":1045,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":85}},"source":["visualisation = {}\n","\n","inp = torch.randn(1,3,8,8)\n","\n","def hook_fn(m, i, o):\n","  visualisation[m] = o \n","  \n","net = myNet()\n","\n","for name, layer in net._modules.items():\n","  layer.register_forward_hook(hook_fn)\n","  \n","out = net(inp) \n","\n"," \n"],"execution_count":0,"outputs":[{"output_type":"stream","text":["conv\n","relu\n","fc1\n","seq\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"zmT8W9gmcehG","colab_type":"text"},"source":["Generally, the `output` for a `nn.Module` is the output of the last `forward`. However, the above functionality can be safely replicated by  without use of hooks.  Just simply append the intermediate outputs in the `forward` function of `nn.Module` object. However, it might be a bit problematic to print the intermediate activation of modules inside `nn.Sequential`. To get past this, we need to register a hook to children modules of the Sequential but not the to Sequential itself.\n","\n"]},{"cell_type":"code","metadata":{"id":"LLHpJTP7fSqw","colab_type":"code","outputId":"700e2523-fb2a-4f61-c5bc-08ab50a3b574","executionInfo":{"status":"ok","timestamp":1556835086114,"user_tz":-330,"elapsed":1063,"user":{"displayName":"Ayoosh Kathuria","photoUrl":"https://lh5.googleusercontent.com/-hC2hkjwNr9s/AAAAAAAAAAI/AAAAAAAACpo/DPqp1uUqR4E/s64/photo.jpg","userId":"11533138969683019189"}},"colab":{"base_uri":"https://localhost:8080/","height":54}},"source":["import torch \n","import torch.nn as nn\n","\n","class myNet(nn.Module):\n","  def __init__(self):\n","    super().__init__()\n","    self.conv = nn.Conv2d(3,10,2, stride = 2)\n","    self.relu = nn.ReLU()\n","    self.flatten = lambda x: x.view(-1)\n","    self.fc1 = nn.Linear(160,5)\n","    self.seq = nn.Sequential(nn.Linear(5,3), nn.Linear(3,2))\n","    \n","   \n","  \n","  def forward(self, x):\n","    x = self.relu(self.conv(x))\n","    x = self.fc1(self.flatten(x))\n","    x = self.seq(x)\n","  \n","\n","net = myNet()\n","visualisation = {}\n","\n","def hook_fn(m, i, o):\n","  visualisation[m] = o \n","\n","def get_all_layers(net):\n","  for name, layer in net._modules.items():\n","    #If it is a sequential, don't register a hook on it\n","    # but recursively register hook on all it's module children\n","    if isinstance(layer, nn.Sequential):\n","      get_all_layers(layer)\n","    else:\n","      # it's a non sequential. Register a hook\n","      layer.register_forward_hook(hook_fn)\n","\n","get_all_layers(net)\n","\n","  \n","out = net(torch.randn(1,3,8,8))\n","\n","# Just to check whether we got all layers\n","visualisation.keys()"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["dict_keys([Conv2d(3, 10, kernel_size=(2, 2), stride=(2, 2)), ReLU(), Linear(in_features=160, out_features=5, bias=True), Linear(in_features=5, out_features=3, bias=True), Linear(in_features=3, out_features=2, bias=True)])"]},"metadata":{"tags":[]},"execution_count":105}]},{"cell_type":"markdown","metadata":{"id":"8s-C_QVze-4p","colab_type":"text"},"source":["Finally, you can turn this tensors into numpy arrays and plot activations ."]}]}