Hot questions for Using Neural networks in matrix multiplication

Top 10 Python Open Source / Neural networks / matrix multiplication


Yesterday I came across this question and for the first time noticed that the weights of the linear layer nn.Linear need to be transposed before applying matmul.

Code for applying the weights:

output = input.matmul(weight.t())

What is the reason for this?

Why are the weights not in the transposed shape just from the beginning, so they don't need to be transposed every time before applying the layer?


I found an answer here: Efficient forward pass in nn.Linear #2159

It seems like there is no real reasoning behind this. However the transpose operation doesn't seem to be slowing down the computation.

According to the issue mentioned above, during the forward pass the transpose operation is (almost) free in terms of computation. While during the backward pass leaving out the transpose operation would actually make computation less efficient with the current implementation.

The last post in that issue sums it up quite nicely:

It's historical weight layout, changing it is backward-incompatible. Unless there is some BIG benefit in terms of speed or convenience, we wont break userland.


I'm currently writing a tensorflow program that requires multiplying a batch of 2-D tensors (a 3-D tensor of shape [None,...]) with a 2-D matrix W. This requires turning W into a 3-D matrix, which requires knowing the batch size.

I have not been able to do this; tf.batch_matmul is no longer usable, x.get_shape().as_list()[0] returns None, which is invalid for a reshaping/tiling operation. Any suggestions? I've seen some people use config.cfg.batch_size, but I don't know what that is.


Solution is to use a combination of tf.shape (which returns the shape at runtime) and tf.tile (which accepts the dynamic shape).

x = tf.placeholder(shape=[None, 2, 3], dtype=tf.float32)
W = tf.Variable(initial_value=np.ones([3, 4]), dtype=tf.float32)
print(x.shape)                # Dynamic shape: (?, 2, 3)

batch_size = tf.shape(x)[0]   # A tensor that gets the batch size at runtime
W_expand = tf.expand_dims(W, axis=0)
W_tile = tf.tile(W_expand, multiples=[batch_size, 1, 1])
result = tf.matmul(x, W_tile) # Can multiply now!

with tf.Session() as sess:
  feed_dict = {x: np.ones([10, 2, 3])}
  print(, feed_dict=feed_dict))    # 10
  print(, feed_dict=feed_dict).shape)  # (10, 2, 4)


I'm implementing a neural network in python, as a part of backpropagation I need to multiply a 3D matrix,call it A, dimension (200, 100, 1) , by a 2D matrix, call it W,dimension (100, 200) the result should have dimensions (200, 200, 1).

A is an error vector, W is a weight matrix, the product is to be used to calculate the updates for the previous layer.

I tried solving it using matrix_multiply(from numpy.core.umath_tests), I tried reshaping W to (100,200,1) and then multiplying, but that throws

ValueError: matrix_multiply: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (m,n),(n,p)->(m,p) (size 100 is different from 1).

How can I solve this?


You could use np.tensordot and then permute axes with swapaxes or simply reshape -


Alternatively, we can use using the only slice along the last axis of A and then after matrix-multiplication extend into 3D -


Or we can use np.einsum -



Is this code segment:

layer_1 =

The same as this one?

layer_1 =, self.weights_0_1)


Yes: dot is available both as a function in the numpy module and as an instance method of an array object.


Input = rand(32,32,3);
Theta = rand(10,16);

Output = zeros(30,30,3,16); % preallocate
for i = 1:30
     for j = 1:30
          Output(i,j,:,:) = permute(cat(2,ones(1,1,3),reshape(Input(i:i+2,j:j+2,1:3),1,9,3)), [2 3 1]).'*Theta;

Whew! I know there is lot going on here but maybe there is a way to speed this up. This code breaks down channels of 32 by 32 CMY image Input into 3 by 3 overlapping matrices, reshapes them into vectors, adds 1 and multiplies by the matrix Theta, to get feature maps (of convolutional neural nets) as an output.


Try changing this line:

Output(i,j,:,:) = permute(cat(2,ones(1,1,3),reshape(Input(i:i+2,j:j+2,1:3),1,9,3)), [2 3 1]).'*Theta;

To this:

Output2(i,j,:,:) = [1 1 1; reshape(Input(i:i+2,j:j+2,:),9,3,1)].'*Theta;

Averaging a thousand loops here, there is a speed up from 16.3ms to 6.9ms on the code.


When in forward method I only do one set of torch.add(torch.bmm(x, exp_w), self.b) then my model is back propagating correctly. When I add another layer - torch.add(torch.bmm(out, exp_w2), self.b2) - then the gradients are not updated and the model isn't learning. If I change the activation function from nn.Sigmoid to nn.ReLU then it works with two layers.

Been thinking about this a day now, and not figuring out why it's not working with nn.Sigmoid.

I've tried different learning rates, Loss functions and optimization functions, but no combination seems to work. When I add the weights together before and after training they are the same.


class MyModel(nn.Module):

    def __init__(self, input_dim, output_dim):
        super(MyModel, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        hidden_1_dimentsions = 20
        self.w = torch.nn.Parameter(torch.empty(input_dim, hidden_1_dimentsions).uniform_(0, 1))
        self.b = torch.nn.Parameter(torch.empty(hidden_1_dimentsions).uniform_(0, 1))

        self.w2 = torch.nn.Parameter(torch.empty(hidden_1_dimentsions, output_dim).uniform_(0, 1))
        self.b2 = torch.nn.Parameter(torch.empty(output_dim).uniform_(0, 1))

    def activation(self):
        return torch.nn.Sigmoid()

    def forward(self, x):
        x = x.view((x.shape[0], 1, self.input_dim))

        exp_w = self.w.expand(x.shape[0], self.w.size(0), self.w.size(1))
        out = torch.add(torch.bmm(x, exp_w), self.b)
        exp_w2 = self.w2.expand(out.shape[0], self.w2.size(0), self.w2.size(1))
        out = torch.add(torch.bmm(out, exp_w2), self.b2)
        out = self.activation()(out)
        return out.view(x.shape[0])


Besides loss functions, activation functions and learning rates, your parameter initialisation is also important. I suggest you to take a look at Xavier initialisation:

Furthermore, for a wide range of problems and network architectures Batch Normalization, which ensures that your activations have zero mean and standard deviation, helps:

If you are interested to know more about the reason for this, it's mostly due to the vanishing gradient problem, which means that your gradients get so small that your weights don't get updated. It's so common that it has its own page on Wikipedia:


I'm going through the "Make Your Own Neural Networks" book and following through the examples to implement my first NN. I understood the basic concepts and in particular this equation where the output is calculated doing a matrix dot product of the inputs and weights:

X = W * I

Where X is the output before applying the Sigmoid, W the link weights and I the inputs.

Now in the book, they do have a function that takes in this input as an array and then they translate that array to a 2 dimensional one. My understanding is that, the value of X is calculated like this based on:

W = [0.1, 0.2, 0.3
     0.4, 0.5, 0.6
     0.7, 0.8, 0.9]

I = [1

So if I now pass in an array for my inputs like [1,2,3], why is that I need do the following to have it converted to a 2-D array as it is done in the book:

inputs = numpy.array(inputs, ndmin=2).T

Any ideas?


Your input here is a one-dimensional list (or a one-dimensional array):

I = [1, 2, 3]

The idea behind this one-dimensional array is the following: if these numbers represent the width in centimetres of a flower petal, its length, and its weight in grams: your flower petal will have a width of 1cm, a length of 2cm, and a weight of 3g.

Converting your input I to a 2-D array is necessary here for two things:

  • first, by default, converting this list to a NumPy array using numpy.array(inputs) will yield an array of shape (3,), with the second dimension left undefined. By setting ndmin=2, it forces the dimensions to be (3, 1), which allows to not generate any NumPy-related problems, for instance when using matrix multiplication, etc.
  • secondly, and perhaps more importantly, as I said in my comment, data in Neural Networks are conventionally stored in arrays this way, under the idea that each row in your array will represent a different feature (so there is a unique list for each feature). In other words, it's just a conventional way to say your not confusing apples and pears (in that case, length and weight)

So when you do inputs = numpy.array(inputs, ndmin=2).T, you end up with:

array([[1],    # width
       [2],    # length
       [3]])   # weight

and not:

array([1, 2, 3])

Hope it made things a bit clearer!


I am learning about neural network and in the process, I have implemented few fully connected nets. I usually add a column bias units(1s) in the input marix and an extra row of weights in weight matrix because that's how I learned to implement neural nets after taking an online course but in many implementations on github I have found that it can also be implemented without inserting bias units in the matrix but instead it can be added separately : XW + b , where b is bias unit .

I don't understand how it works. It seems like a better and more efficient implementation but I don't understand it. For instance , consider the following example:

        1 2 3 4       0.5 0.5
   X =  1 4 5 6    W= 2   3     X*W = [4x2 matrix] 
        1 8 9 5       5   8
                      2   3

The first column in X is bias unit and so is the first row in W

But if the same is written without directly inserting the bias column but by adding it separately it becomes:

       2 3 4       2 3
   X=  4 5 6    W= 5 8    b = 0.5 0.5    X*W = [3x2 matrix]
       8 9 5       2 3

It can be clearly seen that X*W+b from the second expression is not equal to first expression. And furthermore b, a 1x2 matrix cannot be added to X*W which is 3x2 matrix.

So, how can i implement biases using the second method ?


The illustrated methods are the same.

most important:

weights can only assume values between -1 and 1.

note: the first example will give a 3x2 matrix too.

      1 2 3 4           0.5 0.5          27.5  42.5
 X =  1 4 5 6        W= 2   3      X*W = 45.5  70.5
      1 8 9 5           5   8            71.5  111.5                                    
                        2   3

In the last matrix each row is a set of inputs and each column a neuron.

The illustrated methods are the same: Add the bias later is not a problem.

taking the second example:

       |27  42 |            |27 42 |   |0.5 0.5|
 X*W = |45  70 |    X*W+b = |45 70 | + |0.5 0.5| : Same Result.
       |71  111|            |71 111|   |0.5 0.5|                  
If The problem is here:

taking the formula at the link below: Feed_Forward formula

It assume a neural network whit 1 input, 1 hidden and 1 output neurons and it not involves a product of matrices. It's a feedforward passage:

sumH1 = I x w1 + b x wb;

note:(b x wb = 1 x wb = wb).

This passage is than coded at "implementation" paragrafe:

z1 = + b1
a1 = np.tanh(z1)
z2 = + b2
Or here:

B belonging to R^500

Here he make an hypothetic example whit 2 Input, 500 Hidden and 2 Output neurons, where says that w1 is one of the 2x500 connections between I and H, b1 is one of the 500 bias of H, w2 is one of the 2x500 connections between H and O, b2, is one of the 2 bias of O.

To sum up

You can do the feed_forward passage using matrices but you have to add the bias for each connection. The first example you showed is the simplest way. It's clear that if you choose the second one you can not do the product of the 1xN matrice whit the 3x2. But you can add the bias addition when you call the activation function:

a1 = tanH(z1 + b[1]); 

none of the two is a better or more efficient implementation than other.

in the second example you are splitting that matix in 2 parts:

I*W :matix[3x4]     and    b:vector[3] = { 1, 1 , 1 }

in this case you need to add the bias at each hidden neuron too. in your first example you have directly added the bias where: matrix[0][0] = 1 x 0.5 + 2 x 2 + 3 x 5 ecc..

note:matrix[0][0] = sumH1;

in the second one you add the bias later where:matrix[0][0] = 2 x 2 + 3 x 5 ecc.. and sumH1 = matrix[0][0] + B[0]

note: whit "B" we intend the weights of B; B=1.

maybe whit the second example the code will result a little more ordered. nothing more. no significative changes in computer performance or memory occupation.