Bonus. Behind Pytorch Autograd¶

When we implement Neural networks in Pytorch, we only need to consider the forward pass and the backward pass is handled automatically by autograd (the automatic differentiation engine in Pytorch).

As mentioned during lecture, the autograd engine saves a graph, called the autograd graph, based on the operations done during forward propagation (but in reverse order). This graph is then traversed during the call to backward().

Have you ever wondered what's inside the autograd graph?

In fact, each torch.Tensor has an attribute called grad_fn, which is an torch.autograd.graph.Node object (i.e. a node in the autograd graph). In this bonus lab, we will try to explore what's inside the node and play around with that.

Note: You'll need the graphviz package for this task. If you are using a local installation of Python, see this stackoverflow post for instructions.

Your task: The following 4 tasks guides you to explore the autograd graph. Complete the tasks in sequence.

Submission: Submit your writeup to Tasks 1 - 2 and your implementation to Tasks 3 - 4 before/during the tutorial for extra EXP.

If no one solves all 4 tasks, I'll still give out bonus EXP for those who solved at least 3.

from graphviz import Digraph
import torch
from torch import nn
from torch.autograd import Variable

Task 1: A Simple Hook for the Backward Pass¶

Notice that the torch.autograd.graph.Node class contains a method called register_hook, which allows us to hook the backward propagation process!

Let's try to create a very simple linear model and call backward() on it. Also, we register a hook on the layer so that our hook function gets called during backward propagation.

Question: What does the input and output parameters of hook_fn represent in general? Try to see if you can derive the expressions to predict the arguments input and output based on $\boldsymbol{x}$, $\boldsymbol{W}$, $\boldsymbol{y}$ and $\boldsymbol{\hat{y}}$.

There is no coding involved in this task. However, you can add print statements to verify your claims.

def hook_fn(input, output):
    print(f"Hooked! Input: {input}, Output: {output}")

class VerySimpleNet(nn.Module):
    def __init__(self):
        super(VerySimpleNet, self).__init__()
        self.linear = nn.Linear(2, 2, bias=False)
        self.linear.weight = torch.nn.Parameter(torch.Tensor([[0.15,0.20],[0.25,0.30]]))

    def forward(self, x):
        x = self.linear(x)
        # We register a hook on the autograd node attached to x.
        x.grad_fn.register_hook(hook_fn)
        return x

model = VerySimpleNet()
loss = nn.MSELoss()

# Forward propagation
x = torch.Tensor([[1, 2]])
y = torch.Tensor([[1, 2]])
y_pred = model(x)
output = loss(y, y_pred)

# Backward propagation
output.backward()

Hooked! Input: (None, tensor([[-0.4500, -1.1500],
        [-0.9000, -2.3000]])), Output: (tensor([[-0.4500, -1.1500]]),)

Task 2: Tracing Backward Propagation¶

Now, let's try to verify that backward() indeed calculates the gradients in reverse order, from the last layer back to the first layer!

Consider the following network TwoLayerNet. By adding suitable hook functions, verify that the backward() calculates the gradients in reverse order. In addition to this, verify that the gradients calculated are passed sequentially along the way by printing the sum of the input/output tensors (use torch.sum).

# TODO: Define suitable hook functions.


class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = nn.Linear(D_in, H, bias=False)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(H, D_out, bias=False)

    def forward(self, x):
        # TODO: Register the hook functions if necessary.
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

model = TwoLayerNet(1000, 100, 10)
loss = nn.MSELoss()

# Forward propagation
x = torch.randn(64, 1000)  # Random input data
y = torch.randn(64, 10)  # Random target data
y_pred = model(x) 
output = loss(y, y_pred)

# Backward propagation
output.backward()

Task 3: Creating the Autograd Graph¶

Now, we understand that Pytorch autograd traverses the autograd graph node by node and calculates their gradients. But wait... where is the graph?

As we all know, graphs contain edges. Therefore, the autograd graph definitely stores edges between nodes to tell what gradients the engine should calculate next.

The edges are hidden somewhere within the torch.autograd.graph.Node class. Read the documentation and find out where the edges are stored.

To demonstrate your understanding, we have already written a boilerplate code that generates a visualization of the autograd graph, but the critical logic has not been implemented yet (which is, finding the edges of the autograd graph). Complete the function add_nodes so that it enumerates all neighbours of the current node in the autograd graph, and traverse those neighbours recursively.

def visualize_autograd_graph(loss):
    """ Produces a visualization of PyTorch autograd graph """    

    node_attr = dict(style='filled',
                     shape='box',
                     align='left',
                     fontsize='12',
                     ranksep='0.1',
                     height='0.2')
    dot = Digraph(node_attr=node_attr, graph_attr=dict(size="12,12"))
    visited = set()

    """ Helper functions """

    def size_to_str(size):
        return '('+(', ').join(['%d'% v for v in size])+')'

    def create_parameter_node(var, size):
        dot.node(str(id(var)), size_to_str(size), fillcolor='lightblue')

    def create_func_node(var):
        dot.node(str(id(var)), str(type(var).__name__))

    def create_edge(u, v):
        dot.edge(str(id(u)), str(id(v)))

    def add_nodes(var):
        if var in visited:
            return
        visited.add(var)

        if hasattr(var, 'variable'):
            create_parameter_node(var, var.variable.size())
        else:
            create_func_node(var)

        # TODO: Complete the implementation.
        pass

    add_nodes(loss.grad_fn)
    return dot

# Sample test case

# Sample neural network with 2 linear layers
class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = nn.Linear(D_in, H)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(H, D_out)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

model = TwoLayerNet(1000, 100, 10)
loss = nn.MSELoss()

# Forward propagation
x = torch.randn(64, 1000)  # Random input data
y = torch.randn(64, 10)  # Random target data
y_pred = model(x) 
output = loss(y, y_pred)

# Visualize the gradient graph
dot = visualize_autograd_graph(output)

assert len(list(filter(lambda x: 'Backward' in x, dot.body))) == 6, \
    "Incorrect number of internal nodes"
assert len(list(filter(lambda x: 'lightblue' in x, dot.body))) == 4, \
    "Incorrect number of tensors"
assert len(list(filter(lambda x: '->' in x, dot.body))) == 9, \
    "Incorrect number of edges"

print("Sample test case passed, congratulations!")

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[7], line 29
     26 # Visualize the gradient graph
     27 dot = visualize_autograd_graph(output)
---> 29 assert len(list(filter(lambda x: 'Backward' in x, dot.body))) == 6, \
     30     "Incorrect number of internal nodes"
     31 assert len(list(filter(lambda x: 'lightblue' in x, dot.body))) == 4, \
     32     "Incorrect number of tensors"
     33 assert len(list(filter(lambda x: '->' in x, dot.body))) == 9, \
     34     "Incorrect number of edges"

AssertionError: Incorrect number of internal nodes

# Run this to visualize the autograd graph
dot

Task 4: Visited Memory?¶

Notice that in the implementation of add_nodes above, we have used visited memory to avoid visiting an identical object var multiple times. We could delete those lines and the resulting implementation would still pass the sample test case above.

However, this visited memory is necessary for correctness. Your task is to demonstrate this.

Task: Create a (minimal) neural network so that visualize_autograd_graph will yield different visualizations with or without visited memory.

class HackNet(nn.Module):
    def __init__(self):
        super(HackNet, self).__init__()
        # TODO: Initialize your neural network here
        self.linear = nn.Linear(10, 10)

    def forward(self, x):
        # TODO: Implement the forward function
        x = self.linear(x)
        return x

# TODO: You may change the following code,
#       e.g. editing the dimensions of the input/target.
model = HackNet()
loss = nn.MSELoss()

x = torch.randn(64, 10)
y = torch.randn(64, 10)
y_pred = model(x)
output = loss(y, y_pred)

visualize_autograd_graph(output)