Hot questions for Using Neural networks in initialization

Question:

Suppose I have a neural network where I use a normal distribution initialization and I want to use the mean value which is used for initialization as a parameter of the network.

I have a small example:

import torch
parameter_vector = torch.tensor(range(10), dtype=torch.float, requires_grad=True)
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
print('requires_grad:', init_result.requires_grad)
print('result:       ', init_result)

This results in:

requires_grad: True
result:        tensor([ 0.1026,  0.9183,  1.9586,  3.1778,  4.0538,  4.8056,  5.9561,
         6.9501,  7.7653,  8.9583])

So the requires_grad flag was obviously taken over from the mean value tensor resp. parameter_vector.

But does this automatically mean that the parameter_vector will be updated through backward() in a larger network where init_result does affect the end result?

Especially as normal() does not really seem like normal operation because it involves randomness.


Answer:

Thanks to @iacolippo (see comments below the question) the problem is solved now. I just wanted to supplement this by posting what code I am using now, so this may help anyone else.

As presumed in the question and also stated by @iacolippo the code posted in the question is not backpropable:

import torch
parameter_vector = torch.tensor(range(5), dtype=torch.float, requires_grad=True)
print('- initial parameter weights:', parameter_vector)
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
print('- normal init result requires_grad:', init_result.requires_grad)
print('- normal init vector', init_result)
#print('result:       ', init_result)
sum_result = init_result.sum()
sum_result.backward()
print('- summed dummy-loss:', sum_result)
optimizer = torch.optim.SGD([parameter_vector], lr = 0.01, momentum=0.9)
optimizer.step()
print()
print('- parameter weights after update:', parameter_vector)

Out:

- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.0909,  1.1136,  2.1143,  2.8838,  3.9340], grad_fn=<NormalBackward3>)
- summed dummy-loss: tensor(9.9548, grad_fn=<SumBackward0>)

- parameter weights after update: tensor([0., 1., 2., 3., 4.], requires_grad=True)

As you can see calling backward() does not raise an error (see linked issue in comments above), but the parameters won't get updated either with SGD-Step.


Working Example 1

One solution is to use the formula/trick given here: https://stats.stackexchange.com/a/342815/133099

x=μ+σ sample(N(0,1))

To archive this:

sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)

Changes to:

dim = parameter_vector.size(0)
sigma = 0.1
init_result = parameter_vector + sigma*torch.normal(torch.zeros(dim), torch.ones(dim))

After changing these lines the code gets backprobable and the parameter vector gets updated after calling backward() and SGD-Step.

Output with changed lines:

- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.1802,  0.9261,  1.9482,  3.0817,  3.9773], grad_fn=<ThAddBackward>)
- summed dummy-loss: tensor(9.7532, grad_fn=<SumBackward0>)

- parameter weights after update: tensor([-0.0100,  0.9900,  1.9900,  2.9900,  3.9900], requires_grad=True)

Working Example 2

Another way would be using torch.distributions (Documentation Link).

Do do so the respective lines in the code above have to be replaced by:

i = torch.ones(parameter_vector.size(0))
sigma = 0.1
m = torch.distributions.Normal(parameter_vector, sigma*i)
init_result = m.rsample()

Output with changed lines:

- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.0767,  0.9971,  2.0448,  2.9408,  4.1321], grad_fn=<ThAddBackward>)
- summed dummy-loss: tensor(10.0381, grad_fn=<SumBackward0>)

- parameter weights after update: tensor([-0.0100,  0.9900,  1.9900,  2.9900,  3.9900], requires_grad=True)

As it can be seen in the output above - using torch.distributions yields also to backprobable code where the parameter vector gets updated after calling backward() and SGD-Step.

I hope this is helpful for someone.

Question:

He / MSRA initialization, from Delving Deep into Rectifiers, seems to be a recommended weight initialization when using ReLUs.

Is there a built-in way to use this in TensorFlow? (similar to: How to do Xavier initialization on TensorFlow)?


Answer:

tf.contrib.layers.variance_scaling_initializer(dtype=tf.float32)

This will give you He / MRSA initialization. The documentation states that the default arguments for tf.contrib.layers.variance_scaling_initializer correspond to He initialization and that changing the arguments can yield Xavier initialization (this is what is done in TF's internal implementation for Xavier initialization).

Example usage:

W1 = tf.get_variable('W1', shape=[784, 256],
       initializer=tf.contrib.layers.variance_scaling_initializer())

or

initializer = tf.contrib.layers.variance_scaling_initializer()
W1 = tf.Variable(initializer([784,256]))

Question:

I'm building convolutional neural network for classification of the data into different categories The input data is of shape : 30000, 6, 15, 1 the data has 30000 samples, 15 predictors and 6 possible categories.

My model is defined as follows.

x = tf.placeholder("float", [None, 6,15,1])
y = tf.placeholder("float", [None, n_classes])

#Define Weights
weights = {
    'wc1': tf.get_variable('W0', shape=(3,3,1,8), initializer=tf.contrib.layers.xavier_initializer()), 
    'wc2': tf.get_variable('W1', shape=(3,3,32,12), initializer=tf.contrib.layers.xavier_initializer()), 
    'wc3': tf.get_variable('W2', shape=(3,3,64,16), initializer=tf.contrib.layers.xavier_initializer()), 
    'wc4': tf.get_variable('W3', shape=(3,3,64,20), initializer=tf.contrib.layers.xavier_initializer()),
    'wd1': tf.get_variable('W4', shape=(4*4*15,15), initializer=tf.contrib.layers.xavier_initializer()), 
    'out': tf.get_variable('W6', shape=(15,n_classes), initializer=tf.contrib.layers.xavier_initializer()), 
}

biases = {
    'bc1': tf.get_variable('B0', shape=(8), initializer=tf.contrib.layers.xavier_initializer()),
    'bc2': tf.get_variable('B1', shape=(12), initializer=tf.contrib.layers.xavier_initializer()),
    'bc3': tf.get_variable('B2', shape=(16), initializer=tf.contrib.layers.xavier_initializer()),
    'bc4': tf.get_variable('B3', shape=(20), initializer=tf.contrib.layers.xavier_initializer()),
    'bd1': tf.get_variable('B4', shape=(15), initializer=tf.contrib.layers.xavier_initializer()),
    'out': tf.get_variable('B5', shape=(6), initializer=tf.contrib.layers.xavier_initializer()),
}

#Define convolutional layer
def conv2d(x, W, b, strides=1, reuse=True):
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

#Define Maxpool layer
def maxpool2d(x, k=2):
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1],padding='SAME')

#Define a convolutional neural network function
def conv_net(x, weights, biases):  
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    conv3 = conv2d(conv2, weights['wc3'], biases['bc3'])
    conv3 = maxpool2d(conv3, k=2)

    conv4 = conv2d(conv3, weights['wc4'], biases['bc4'])
    conv4 = maxpool2d(conv4, k=2)


    # Fully connected layer
    # Reshape conv2 output to fit fully connected layer input
    fc1 = tf.reshape(conv4, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    # Output, class prediction 
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out

I'm getting the error : ValueError: Dimensions must be equal, but are 8 and 32 for 'Conv2D_1' (op: 'Conv2D') with input shapes: [?,8,3,8], [3,3,32,12].

When I execute :

pred = conv_net(x, weights, biases)

I went through multiple conv2D models but most of them are for image classification and I might be missing something here that I'm not able to identify. Please help.


Answer:

The number of input channels for the weights wc2, wc3, and wc4 need to be the same as the number of output channels for the previous layers. Keeping the number of output channels the way you have it, they would be changed to:

    'wc1': tf.get_variable('W0', shape=(3,3,1,8), initializer=tf.contrib.layers.xavier_initializer()),
    'wc2': tf.get_variable('W1', shape=(3,3,8,12), initializer=tf.contrib.layers.xavier_initializer()),
    'wc3': tf.get_variable('W2', shape=(3,3,12,16), initializer=tf.contrib.layers.xavier_initializer()),
    'wc4': tf.get_variable('W3', shape=(3,3,16,20), initializer=tf.contrib.layers.xavier_initializer()),

Question:

I initialized the network model before the k-fold starts.

Does that mean that the model trains for the first fold and this model with the trained weights is taken for the second fold and so on? What if the last fold is bad and the whole model is bad?


Answer:

It depends on what you mean by "initialized the network", you should show some snippet of code to make people understand your problem.

In principle, k-fold cross validation is a technique used to have a better estimation of the performance of a model. The concept is easy, whithout k-fold you just split dataset into train/test, you use unseen samples in test set to estimate a performance/error, but usually data is not perfect, it's a bit dirty, so it can happen that "bad" samples end up to be in the test set, and when you use them to estimate the performance of a model you'll get a value which does not represent the real one.

To reduce the error in the estimate of the error/performance, you split dataset into k equally distributed folds, then you train k times a NEW model (so weights are each time initialized from scratch), testing each time on one of the k different "folds" and training it on the remaining samples of the dataset.

By doing so, you'll have k different estimates of the error/performance of your model.

If you want to have a single value as measure, you just average the results. OF course you can use the results to do whatever you like, you can SELECT the best model, you can average the weights of the k models, you can average the "top n" model weights and so on.

So, answering your question, NO you don't keep your weights. Upon the k models you are training, it can happen that one of them is "bad", but you are using k-fold just to VALIDATE your model, not to better train it! After validation you can decide what to do. You are looking for a measure of how much your model is "good", by doing this you are just more sure your result is near the real value.

If you want to use dataset to reduce other types of errors (like overfitting or other stuff) you should check ensemble methods.

I hope this was helpful

Question:

I am trying to initialize weights of multi-layer Neural Network using the following code in tensorflow.

def initialize_parameters(layers_dims):
    parameters = {}
    tf.set_random_seed(1)                  

    L = len(layers_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = tf.get_variable("parameters['W' + str(l)]", [layers_dims[l],layers_dims[l-1]], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
        parameters['b' + str(l)] = tf.get_variable("parameters['b' + str(l)]", [layers_dims[l],1], initializer = tf.zeros_initializer())
    return parameters

and the way I am calling this function is

layers_dims = [100,1]
tf.reset_default_graph()
with tf.Session() as sess:
  parameters = initialize_parameters(layers_dims)

I am getting an error which says that tf.get.variable is not able to take the name of of the variable as parameters['W' + str(l)]. Instead when I use the name like W1 or b1, it works fine.

My aim is to initialize my "L" layer neural network and store the parameters in a dictionary named parameters.

Is there any way around it?


Answer:

You placed code that python needs to execute into a string. Just fix this problem and you'll be fine.

parameters['W' + str(l)] = tf.get_variable("parameters['W' + str(l)]", [layers_dims[l],layers_dims[l-1]], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
parameters['b' + str(l)] = tf.get_variable("parameters['b' + str(l)]", [layers_dims[l],1], initializer = tf.zeros_initializer())

Here, you're assigning at the python variable parameters['W' + str(l)] (that python correctly evaluates to parameters["W1"], parameters["W2"] and so on) the tensorflow variable with name "parameters['b' + str(l)]". As you can see, the name is a constant string.

Instead, you have to make python evaluate the parameters['b' + str(l)] statement.

Therefore, just remove the double quotes making the string "parameters['b' + str(l)]" a concatenation operation that the python interpreter will execute. You also have to remove the parameters dictionary lookup, because it's wrong (you are looking into a dictionary for a certain key that it's not present yet).

parameters['W' + str(l)] = tf.get_variable('W' + str(l), [layers_dims[l],layers_dims[l-1]], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
parameters['b' + str(l)] = tf.get_variable('b' + str(l), [layers_dims[l],1], initializer = tf.zeros_initializer())

Question:

I have a Multi Task Network with two similar branches and a pre-trained network with only one branch (which is also same). I want to initialize the weights of the layers in the two branches(in my multi task network) with the weights of the layers in my pre-trained network.

Now, I can initialize one of the branch correctly by using the same name for the layers as in the pre-trained network. But, I have to keep the names of the layers in the other branch different, and thus those layers won't take the pre-trained weights.

Also, I don't want to share the weights in the two branches. So, giving the same name to the weights in the corresponding layers in the two branches won't work.

Is there a nice way/hack to do this ?

PS: I would want to avoid Network Surgery, but any comments, explaining a nice way to do it, are also welcome.

Clarification : I just want to initialize the two branches with the same weights. They can learn different weights during the training phase, since they are governed by different loss layers.


Answer:

The answer by Przemak D is a nice hack to do the above.

  1. give different names to the layers in the two branches and enable weight sharing
  2. initialize the network and train for 1-2 iterations
  3. then train the original network(without weight sharing) initializing the weights with the caffemodel obtained as a result after step 2.

The above is a nice hack, but net surgery is a better way to do this.

Question:

I'm trying to create a small neural network with custom connections between neurons. The connections should exist over several layers and not be fully connected (sparse) as shown in the picture. I would also like to do the weight initialization manually and not completely randomly. My goal is to determine whether a connection is positive or negative. Is it possible to create such a neural net in tensorflow (python/js) or pytorch?


Answer:

To summarize: Can you do it? -- Yes, absolutely. Is it going to be pretty? -- No, absolutely not.

In my explanation, I will further focus on PyTorch, as this is the library that I am more comfortable with, and that is especially more useful if you have custom operations that you can easily express in a pythonic manner. Tensorflow also has eager execution mode (more serious integration from version 2, if I remember that correctly), but it is traditionally done with computational graphs, which make this whole thing a little uglier than it needs to be.

As you hopefully know, backpropagation (the "learning" step in any ANN) is basically an inverse pass through the network, to calculate gradients, or at least close enough to the truth for our problem at hand. Importantly, torch functions store this "reverse" direction, which makes it trivial for the user to call backpropagation functions.

To model a simple network as described in your image, we have only one major disadvantage: The available operations are usually excelling at what they are doing because they are simply and can be optimized quite heavily. In your case, you have to express different layers as custom operations, which generally scales incredibly poorly, unless you can express the functionals as some form of matrix operation, which I do not see straigt away in your example. I am further assuming that you are applying some form of non-linearity, as it would otherwise be a network that would fail for any non-linearly separable problem.

import torch
import torch.nn as nn
class CustomNetwork(nn.module):
    def __init__(self):
        self.h_1_1 = nn.Sequential(nn.Linear(1,2), nn.ReLU) # top node in first layer
        self.h_1_2 = nn.Sequential(nn.Linear(1,2), nn.ReLU) # bottom node in first layer
        # Note that these nodes have no shared weights, which is why we
        # have to initialize separately.
        self.h_2_1 = nn.Sequential(nn.Linear(1,1), nn.ReLU) # top node in second layer
        self.h_2_2 = nn.Sequential(nn.Linear(1,1), nn.ReLU) # bottom node in second layer

        self.h_2_1 = nn.Sequential(nn.Linear(2,1), nn.ReLU) # top node in third layer
        self.h_2_2 = nn.Sequential(nn.Linear(2,1), nn.ReLU) # bottom node in third layer
       # out doesn't require activation function due to pairing with loss function
        self.out = nn.Linear(2,1)

    def forward(self, x):
        # x.shape: (batch_size, 2)

        # first layer. shape of (batch_size, 2), respectively
        out_top = self.h_1_1(x[:,0])
        out_bottom = self.h_1_2(x[:,1])

        # second layer. shape of (batch_size, 1), respectively
        out_top_2 = self.h_2_1(out_top[:,0])
        out_bottom_2 = self.h_2_2(out_bottom[:,0])

        # third layer. shape of (batch_size, 1), respectively
        # additional concatenation of previous outputs required.
        out_top_3 = self.h_3_1(torch.cat([out_top_2, -1 * out_top[:,1]], dim=1))
        out_bottom_3 = self.h_3_2(torch.cat([out_bottom_2, -1 * out_bottom[:,1]], dim=1))
        return self.out(torch.cat([out_top_3, out_bottom_3], dim=1))

As you can see, any computational step is (in this case rather explicitly) given, and very much possible. Again, once you want to scale your number of neurons for each layer, you are going to have to be a little more creative in how you process, but for-loops do very much work in PyTorch as well. Note that this will in any case be much slower than a vanilla linear layer, though. If you can live with seperately trained weights, you can always also just define separate linear layers of smaller size and put them in a more convenient fashion.

Question:

I am creating a model in tensorflow with all layers having relu as the activation layer. However, when the batch size increases to 500, I want to change the model such that the second last layer to the output layer has sigmoid activation layer.

What I am confused about is that d I need to re-initialize all the variables since I am replacing the optimizer in the middle? Or do I keep the old variables?


Answer:

This is a very interesting question. I think it depends on your datasets and models.

Yes: Perhaps, you can use weights (before batch size 500) as pre-trained like what Deep Belief Networks (with RBM) do.

No: Perhaps, these pre-trained weights hurt your model and may not better than other good initializers such as xavier initializer https://www.tensorflow.org/versions/r0.8/api_docs/python/contrib.layers.html#xavier_initializer

I think it's worth try both options.

Question:

If a neural network is initialized with small random weights and run for a very large number of iterations (20k or more), cqn the final accuracy range (difference of order of magnitude 10e-4 is okay) differ much for a rerun of same model?


Answer:

Yes it can differ. It will usually not be the case, but in theory it can, and sometimes it does. It's due to the randomness in the initialization and in the feeding order during training, which can both lead the optimization to end up in a different local minimum of your cost function each time. This is why researchers developed initialization techniques that are supposed to be better than others, such as Xavier initialisation.

It's good practice, if you have the time, to train several times, just to see if your results are very different between the runs, or not.