Hot questions for Using Neural networks in weighted average

Question:

Is it possible to load weight in NN in keras in model.add? I want to load the weight based on Xavier or another initializers. How I can do this in keras?

For instance, weight=[w1,w2,w3,w4] how we could do this in keras?

For instance, in TF we have: initializer=tf.contrib.layers.xavier_initializer()


Answer:

Assuming xxx.h5 is your weights file, do:

weights_path = 'path/xxx.h5'

You may also load weights in keras like this:

model.load_weights(weights_path, by_name=True)

Where model is your keras model and its weights architecture match the weights you want to import

Question:

I am reading through info about the l2 regularization of neural network weights. So far I understood, the intention is that weights get pushed towards zero the larger they become i.e. large weights receive a high penalty while lower ones are less severely punished.

The formulae is usually:

new_weight = weight * update + lambda * sum(squared(weights))

My question: Why is this always positive? If the weight is already positive the l2 will never decrease it but makes things worse and pushes the weight away from zero. This is the case in almost all formulae I saw so far, why is that?


Answer:

The formula you presented is very vague about what an 'update' is.

First, what is regularization? Generally speaking, the formula for L2 regularization is:

(n is traing set size, lambda scales the influence of the L2 term)

You add an extra term to your original cost function , which will be also partially derived for the update of the weights. Intuitively, this punishes big weights, so the algorithm tries to find the best tradeoff between small weights and the chosen cost function. Small weights are associated with finding a simpler model, as the behavior of the network does not change much when given some random outlying values. This means it filters out the noise of the data and comes down to learn the simplest possible solution. In other words, it reduces overfitting.

Going towards your question, let's derive the update rule. For any weight in the graph, we get

Thus, the update formula for the weights can be written as (eta is the learning rate)

Considering only the first term, the weight seems to be driven towards zero regardless of what's happening. But the second term can add to the weight, if the partial derivative is negative. All in all, weights can be positive or negative, as you cannot derive a constraint from this expression. The same applies to the derivatives. Think of fitting a line with a negative slope: the weight has to be negative. To answer your question, neither the derivative of regularized cost nor the weights have to be positive all the time.

If you need more clarification, leave a comment.

Question:

I would like to ask if it is possible to set some weights in a layer to particular number or zero of pretrained model.

For example, I want to download Lecun model and set some weights in the last layer to number e.x 4 and calculate the accuracy.

How can I do that?


Answer:

You are after "net surgery". In python you can load the net and get direct access to the stored weights. Then you can tweak them as you pleased and save the modified net.

Question:

I recently watched Andrew Ng's video on SGDM. I understand that the momentum term updates the gradient by weighting the last gradient and using a small component of V_dw. I don't understand why momentum is also known as exponentially weighted average. Also, in Ng's video at 6:37 he says using Beta = 0.9 effectively means using an average of the last 10 gradients. Can someone explain how that works? To me, it's just a scalar weighting of 1-0.9 to all the gradients in the vector dW.

Appreciate any insight! I feel like I'm missing something fundamental.


Answer:

You just have to think about what is in your last gradient. The last gradient is already a weighted gradient, due to the momentum term.

In the first step you will just do a gradient descent. In the second step you will have a momentum gradient of m_grad_2 = grad_2 + 0.9 m_grad_1. In the third step you will again have a momentum gradient m_grad_3 = grad_3 + 0.9 m_grad_2, but the old gradient is composed of a momentum term. Therefore 0.9*m_grad_2 = 0.9 * (grad_2 + 0.9 grad_1), which is 0.9 grad_2 + 0.81 grad_1. Therefore the impact of a gradient on the kth step will be 0.9^k. After 10 steps the impact will be quite small.