Hot questions for Using Neural networks in autograd

Question:

I have a list of LongTensors, and another list of labels. I'm new to PyTorch and RNN's so I'm quite confused as to how to implement minibatch training for the data I have. There is much more to this data, but I want to keep it simple, so I can understand only how to implement the minibatch training part. I'm doing multiclass classification based on the final hidden state of an LSTM/GRU trained on variable length inputs. I managed to get it working with batch size 1(basically SGD) but I'm struggling with implementing minibatches.

Do I have to pad the sequences to the maximum size and create a new tensor matrix of larger size which holds all the elements? I mean like this:

inputs = pad(sequences)
train = DataLoader(inputs, batch_size=batch_size, shuffle=True)
for i, data in train:
   #do stuff using LSTM and/or GRU models

Is this the accepted way of doing minibatch training on custom data? I couldn't find any tutorials on loading custom data using DataLoader(but I assume that's the way to create batches using pyTorch?)

Another doubt I have is with regards to padding. The reason I'm using LSTM/GRU is because of the variable length of the input. Doesn't padding defeat the purpose? Is padding necessary for minibatch training?


Answer:

Yes. The issue with minibatch training on sequences which have different lengths is that you can't stack sequences of different lengths together.

Normally one would do.

for e in range(epochs):
    sequences = shuffle(sequences)
    for mb in range(len(sequences)/mb_size):
        batch = torch.stack(sequences[mb*mb_size:(mb+1)*mb_size])

and then you apply your neural network on your batch. But because your sequences are of different lengths, the torch.stack will fail. So indeed what you have to do is to pad your sequences with zeros so that they all have the same length (at least in a minibatch). So you have 2 options:

1) At the very very beginning, pad all your sequences with initial zeros so that they all have the same length as your longest sequence of all your data.

OR

2) On the fly, for each minibatch, before stacking the sequences together, pad all the sequences that will go into the minibatch with initial zeros so that they all have the same length as the longest sequence of the minibatch.

Question:

In the example for the Torch tutorial for Python, they use the following graph:

x = [[1, 1], [1, 1]]
y = x + 2
z = 3y^2
o = mean( z )  # 1/4 * x.sum()

Thus, the forward pass gets us this:

x_i = 1, y_i = 3, z_i = 27, o = 27

In code this looks like:

import torch

# define graph
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean()

# if we don't do this, torch will only retain gradients for leaf nodes, ie: x
y.retain_grad()
z.retain_grad()

# does a forward pass
print(z, out)

however, I get confused at the gradients computed:

# now let's run our backward prop & get gradients
out.backward()
print(f'do/dz = {z.grad[0,0]}')

which outputs:

do/dx = 4.5

By chain rule, do/dx = do/dz * dz/dy * dy/dx, where:

dy/dx = 1
dz/dy = 9/2 given x_i=1
do/dz = 1/4 given x_i=1

which means:

do/dx = 1/4 * 9/2 * 1 = 9/8

However this doesn't match the gradients returned by Torch (9/2 = 4.5). Perhaps I have a math error (something with the do/dz = 1/4 term?), or I don't understand autograd in Torch.

Any pointers?


Answer:

do/dz = 1 / 4
dz/dy = 6y = 6 * 3 = 18
dy/dx = 1

therefore, do/dx = 9/2

Question:

I have been using Pytorch for a while now. One question I had regarding backprop is as follows:

let's say we have a loss function for a neural network. For doing backprop, I have seen two different versions. One like:

optimizer.zero_grad()
autograd.backward(loss)
optimizer.step()

and the other one like:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Which one should I use? Is there any difference between these two versions?

As a last question, do we need to specify the requires_grad=True for the parameters of every layer of our network to make sure their gradients is being computed in the backprop?

For example do I need to specify it for the layer nn.Linear(hidden_size, output_size) inside my network or it is automatically being set to True by default?


Answer:

so just a quick answer: both autograd.backward(loss) and loss.backward() are actually the same. Just look at the implementation of tensor.backward() (as your loss is just a tensor), where tensor.loss just calls autograd.backward(loss).

As to your second question: whenever you use a prefabricated layer such as nn.Linear, or convolutions, or RNNs, etc., all of them rely on nn.Parameter attributes to store the parameters values. And, as the docs say, these default with requires_grad=True.

Update to a follow-up in the comments: To answer what happens to tensors when they are in a backward pass depends on whether a variable is on the computation path between the "output" and a leaf variable, or not. If not, it is not entirely clear what backprop should compute - after all, the entire purpose is to compute gradients for parameters, i.e., leaf-variables. If the tensor is on that path, all gradients will generally be automatically computed. For a more thorough discussion, see this question and this tutorial from the docs.

Question:

I'm trying to implement a simple neural network that is supposed to learn an grayscale image. The input consist of the 2d indices of a pixel, the output should be the value of that pixel.

The net is constructed as follows: Each neuron is connected to the input (i.e. the indices of the pixel) as well as to the output of each of the previous neurons. The output is just the output of the last neuron in this sequence.

This kind of network has been very successfull in learning images, as demonstrated e.g. here.

The Problem: In my implementation the loss function stays between 0.2 and 0.4 depending on the number of neurons, the learning rate and the number of iterations used, which is pretty bad. Also if you compare the output to what what we've trained it for there it just looks like noise. But this is the first time I used torch.cat within the network, so I'm not sure whether this is the culprit. Can anyone see what I'm doing wrong?

from typing import List
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn import Linear

class My_Net(nn.Module):
    lin: List[Linear]

    def __init__(self):
        super(My_Net, self).__init__()
        self.num_neurons = 10
        self.lin = nn.ModuleList([nn.Linear(k+2, 1) for k in range(self.num_neurons)])

    def forward(self, x):
        v = x
        recent = torch.Tensor(0)
        for k in range(self.num_neurons):
            recent = F.relu(self.lin[k](v))
            v = torch.cat([v, recent], dim=1)
        return recent

    def num_flat_features(self, x):
        size = x.size()[1:]
        num = 1
        for i in size():
            num *= i
        return num

my_net = My_Net()
print(my_net)

#define a small 3x3 image that the net is supposed to learn
my_image = [[1.0, 1.0, 1.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]] #represents a T-shape
my_image_flat = []    #output of the net is the value of a pixel
my_image_indices = [] #input to the net is are the 2d indices of a pixel
for i in range(len(my_image)):
    for j in range(len(my_image[i])):
        my_image_flat.append(my_image[i][j])
        my_image_indices.append([i, j])

#optimization loop
for i in range(100):
    inp = torch.Tensor(my_image_indices)

    out = my_net(inp)

    target = torch.Tensor(my_image_flat)
    criterion = nn.MSELoss()
    loss = criterion(out.view(-1), target)
    print(loss)

    my_net.zero_grad()
    loss.backward()
    optimizer = optim.SGD(my_net.parameters(), lr=0.001)
    optimizer.step()

print("output of current image")
print([[my_net(torch.Tensor([[i,j]])).item() for i in range(3)] for j in range(3)])
print("output of original image")
print(my_image)

Answer:

Yes, torch.cat is backprob-able. So you use it without problems for this.

What's the problem here is that you define a new optimizer at every iteration. Instead you should define it once after you defined your model.

So having this changed the code works fine and loss is decreasing continuously. I also added a print out every 5000 iterations to show the progress.

from typing import List
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn import Linear

class My_Net(nn.Module):
    lin: List[Linear]

    def __init__(self):
        super(My_Net, self).__init__()
        self.num_neurons = 10
        self.lin = nn.ModuleList([nn.Linear(k+2, 1) for k in range(self.num_neurons)])

    def forward(self, x):
        v = x
        recent = torch.Tensor(0)
        for k in range(self.num_neurons):
            recent = F.relu(self.lin[k](v))
            v = torch.cat([v, recent], dim=1)
        return recent

    def num_flat_features(self, x):
        size = x.size()[1:]
        num = 1
        for i in size():
            num *= i
        return num

my_net = My_Net()
print(my_net)

optimizer = optim.SGD(my_net.parameters(), lr=0.001)



#define a small 3x3 image that the net is supposed to learn
my_image = [[1.0, 1.0, 1.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0]] #represents a T-shape
my_image_flat = []    #output of the net is the value of a pixel
my_image_indices = [] #input to the net is are the 2d indices of a pixel
for i in range(len(my_image)):
    for j in range(len(my_image[i])):
        my_image_flat.append(my_image[i][j])
        my_image_indices.append([i, j])

#optimization loop
for i in range(50000):
    inp = torch.Tensor(my_image_indices)

    out = my_net(inp)

    target = torch.Tensor(my_image_flat)
    criterion = nn.MSELoss()
    loss = criterion(out.view(-1), target)
    if i % 5000 == 0:
        print('Iteration:', i, 'Loss:', loss)

    my_net.zero_grad()
    loss.backward()
    optimizer.step()
print('Iteration:', i, 'Loss:', loss)

print("output of current image")
print([[my_net(torch.Tensor([[i,j]])).item() for i in range(3)] for j in range(3)])
print("output of original image")
print(my_image)

Loss output:

Iteration: 0 Loss: tensor(0.4070)
Iteration: 5000 Loss: tensor(0.1315)
Iteration: 10000 Loss: tensor(1.00000e-02 *
       8.8275)
Iteration: 15000 Loss: tensor(1.00000e-02 *
       5.6190)
Iteration: 20000 Loss: tensor(1.00000e-02 *
       3.2540)
Iteration: 25000 Loss: tensor(1.00000e-02 *
       1.3628)
Iteration: 30000 Loss: tensor(1.00000e-03 *
       4.4690)
Iteration: 35000 Loss: tensor(1.00000e-03 *
       1.3582)
Iteration: 40000 Loss: tensor(1.00000e-04 *
       3.4776)
Iteration: 45000 Loss: tensor(1.00000e-05 *
       7.9518)
Iteration: 49999 Loss: tensor(1.00000e-05 *
       1.7160)

So the loss goes down to 0.000017 in this case. I have to admit that your error surface is really ragged. Depending on the on the initial weights it may also converge to a minimum of 0.17, 0.10 .. etc. The local minimum where it converges can be very different. So you might try initializing your weights within a smaller range.

Btw. here is the output without changing the location of defining the optimizer:

Iteration: 0 Loss: tensor(0.5574)
Iteration: 5000 Loss: tensor(0.5556)
Iteration: 10000 Loss: tensor(0.5556)
Iteration: 15000 Loss: tensor(0.5556)
Iteration: 20000 Loss: tensor(0.5556)
Iteration: 25000 Loss: tensor(0.5556)
Iteration: 30000 Loss: tensor(0.5556)
Iteration: 35000 Loss: tensor(0.5556)
Iteration: 40000 Loss: tensor(0.5556)
Iteration: 45000 Loss: tensor(0.5556)