## Hot questions for Using Neural networks in automatic differentiation

Question:

I'm going through the neural transfer pytorch tutorial and am confused about the use of `retain_variable`

(deprecated, now referred to as `retain_graph`

). The code example show:

class ContentLoss(nn.Module): def __init__(self, target, weight): super(ContentLoss, self).__init__() self.target = target.detach() * weight self.weight = weight self.criterion = nn.MSELoss() def forward(self, input): self.loss = self.criterion(input * self.weight, self.target) self.output = input return self.output def backward(self, retain_variables=True): #Why is retain_variables True?? self.loss.backward(retain_variables=retain_variables) return self.loss

From the documentation

retain_graph (bool, optional) – If False, the graph used to compute the grad will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.

So by setting `retain_graph= True`

, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?

Answer:

@cleros is pretty on the point about the use of `retain_graph=True`

. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.

##### An illustrative example

Suppose that we have a computation graph shown above. The variable `d`

and `e`

is the output, and `a`

is the input. For example,

import torch from torch.autograd import Variable a = Variable(torch.rand(1, 4), requires_grad=True) b = a**2 c = b*2 d = c.mean() e = c.sum()

when we do `d.backward()`

, that is fine. After this computation, the part of graph that calculate `d`

will be freed by default to save memory. So if we do `e.backward()`

, the error message will pop up. In order to do `e.backward()`

, we have to set the parameter `retain_graph`

to `True`

in `d.backward()`

, i.e.,

d.backward(retain_graph=True)

As long as you use `retain_graph=True`

in your backward method, you can do backward any time you want:

d.backward(retain_graph=True) # fine e.backward(retain_graph=True) # fine d.backward() # also fine e.backward() # error will occur!

More useful discussion can be found here.

##### A real use case

Right now, a real use case is multi-task learning where you have multiple loss which maybe be at different layers. Suppose that you have 2 losses: `loss1`

and `loss2`

and they reside in different layers. In order to backprop the gradient of `loss1`

and `loss2`

w.r.t to the learnable weight of your network independently. You have to use `retain_graph=True`

in `backward()`

method in the first back-propagated loss.

# suppose you first back-propagate loss1, then loss2 (you can also do the reverse) loss1.backward(retain_graph=True) loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready optimizer.step() # update the network parameters

Question:

Suppose I have an artificial neural networks with 5 hidden layers. For the moment, forget about the details of the neural network model such as biases, the activation functions used, type of data and so on ... . Of course, the activation functions are differentiable.

With symbolic differentiation, the following computes the gradients of the objective function with respect to the layers' weights:

w1_grad = T.grad(lost, [w1]) w2_grad = T.grad(lost, [w2]) w3_grad = T.grad(lost, [w3]) w4_grad = T.grad(lost, [w4]) w5_grad = T.grad(lost, [w5]) w_output_grad = T.grad(lost, [w_output])

This way, to compute the gradients w.r.t *w1* the gradients w.r.t w2, w3, w4 and w5 must first be computed. Similarly to compute the gradients w.r.t *w2* the gradients w.r.t w3, w4 and w5 must be computed first.

However, I could the following code also computes the gradients w.r.t to each weight matrix:

w1_grad, w2_grad, w3_grad, w4_grad, w5_grad, w_output_grad = T.grad(lost, [w1, w2, w3, w4, w5, w_output])

I was wondering, is there any difference between these two methods in terms of performance? Is Theano intelligent enough to avoid re-computing the gradients using the second method? By intelligent I mean to compute w3_grad, Theano should [preferably] use the pre-computed gradients of w_output_grad, w5_grad and w4_grad instead of computing them again.

Answer:

Well it turns out Theano does not take the previously-computed gradients to compute the gradients in lower layers of a computational graph. Here's a dummy example of a neural network with 3 hidden layers and an output layer. However, it's **not** going to be a big deal at all since computing the gradients is a once-in-a-life-time operation unless you have to compute the gradient on each iteration. Theano returns a symbolic expression for the derivatives as a computational graph and you can simply use it as a function from that point on. From that point on we simply use the function derived by Theano to compute **numerical** values and update the weights using those.

import theano.tensor as T import time import numpy as np class neuralNet(object): def __init__(self, examples, num_features, num_classes): self.w = shared(np.random.random((16384, 5000)).astype(T.config.floatX), borrow = True, name = 'w') self.w2 = shared(np.random.random((5000, 3000)).astype(T.config.floatX), borrow = True, name = 'w2') self.w3 = shared(np.random.random((3000, 512)).astype(T.config.floatX), borrow = True, name = 'w3') self.w4 = shared(np.random.random((512, 40)).astype(T.config.floatX), borrow = True, name = 'w4') self.b = shared(np.ones(5000, dtype=T.config.floatX), borrow = True, name = 'b') self.b2 = shared(np.ones(3000, dtype=T.config.floatX), borrow = True, name = 'b2') self.b3 = shared(np.ones(512, dtype=T.config.floatX), borrow = True, name = 'b3') self.b4 = shared(np.ones(40, dtype=T.config.floatX), borrow = True, name = 'b4') self.x = examples L1 = T.nnet.sigmoid(T.dot(self.x, self.w) + self.b) L2 = T.nnet.sigmoid(T.dot(L1, self.w2) + self.b2) L3 = T.nnet.sigmoid(T.dot(L2, self.w3) + self.b3) L4 = T.dot(L3, self.w4) + self.b4 self.forwardProp = T.nnet.softmax(L4) self.predict = T.argmax(self.forwardProp, axis = 1) def loss(self, y): return -T.mean(T.log(self.forwardProp)[T.arange(y.shape[0]), y]) x = T.matrix('x') y = T.ivector('y') nnet = neuralNet(x) loss = nnet.loss(y) diffrentiationTime = [] for i in range(100): t1 = time.time() gw, gw2, gw3, gw4, gb, gb2, gb3, gb4 = T.grad(loss, [nnet.w, nnet.w2, logReg.w3, nnet.w4, nnet.b, nnet.b2, nnet.b3, nnet.b4]) diffrentiationTime.append(time.time() - t1) print 'Efficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime)) diffrentiationTime = [] for i in range(100): t1 = time.time() gw = T.grad(loss, [nnet.w]) gw2 = T.grad(loss, [nnet.w2]) gw3 = T.grad(loss, [nnet.w3]) gw4 = T.grad(loss, [nnet.w4]) gb = T.grad(loss, [nnet.b]) gb2 = T.grad(loss, [nnet.b2]) gb3 = T.grad(loss, [nnet.b3]) gb4 = T.grad(loss, [nnet.b4]) diffrentiationTime.append(time.time() - t1) print 'Inefficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

This will print out the followings:

Efficient Method: Took 0.061056 seconds with std 0.013217 Inefficient Method: Took 0.305081 seconds with std 0.026024

This shows that Theano uses a dynamic-programming approach to compute gradients for the efficient method.

Question:

I am trying to understand Pytorch autograd in depth; I would like to observe the gradient of a simple tensor after going through a sigmoid function as below:

import torch from torch import autograd D = torch.arange(-8, 8, 0.1, requires_grad=True) with autograd.set_grad_enabled(True): S = D.sigmoid() S.backward()

My goal is to get D.grad() but even before calling it I get the runtime error:

RuntimeError: grad can be implicitly created only for scalar outputs

I see another post with similar question but the answer over there is not applied to my question. Thanks

Answer:

The error means you can only run `.backward`

(with no arguments) on a unitary/scalar tensor. I.e. a tensor with a single element.

For example, you could do

T = torch.sum(S) T.backward()

since `T`

would be a scalar output.

I posted some more information on using pytorch to compute derivatives of tensors in this answer.