Hot questions for Using Neural networks in vectorization

Question:

I have build a RNN language model with attention and I am creating context vector for every element of the input by attending all the previous hidden states (only one direction).

The most straight forward solution in my opinion is using a for-loop over the RNN output, such that each context vector is computed one after another.

import torch
import torch.nn as nn
import torch.nn.functional as F

class RNN_LM(nn.Module):
    def __init__(self, hidden_size, vocab_size, embedding_dim=None, droprate=0.5):
        super().__init__()
        if not embedding_dim:
            embedding_dim = hidden_size
        self.embedding_matrix = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=False)
        self.attn = nn.Linear(hidden_size, hidden_size)
        self.vocab_dist = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(droprate)

    def forward(self, x):
        x = self.dropout(self.embedding_matrix(x.view(-1, 1)))
        x, states = self.lstm(x)
        #print(x.size())
        x = x.squeeze()
        content_vectors = [x[0].view(1, -1)]
        # for-loop over hidden states and attention
        for i in range(1, x.size(0)):
            prev_states = x[:i]
            current_state = x[i].view(1, -1)

            attn_prod = torch.mm(self.attn(current_state), prev_states.t())
            attn_weights = F.softmax(attn_prod, dim=1)
            context = torch.mm(attn_weights, prev_states)
            content_vectors.append(context)

        return self.vocab_dist(self.dropout(torch.cat(content_vectors)))

Note: The forward method here is only used for training.

However this solution is not very efficient as the code is not well parallelizable with computing each context vector sequently. But since the context vectors are not dependent on each other, I wonder if there is a non-sequential way of calculating them.

So is there is a way to compute the context vectors without for-loop so that more of computation can be parallelized?


Answer:

Ok, for clarity: I assume we only really care about vectorizing the for loop. What is the shape of x? Assuming x is 2-dimensional, I have the following code, where v1 executes your loop and v2 is a vectorized version:

import torch
import torch.nn.functional as F

torch.manual_seed(0)

x = torch.randn(3, 6)

def v1():
    for i in range(1, x.size(0)):
        prev = x[:i]
        curr = x[i].view(1, -1)

        prod = torch.mm(curr, prev.t())
        attn = prod # same shape
        context = torch.mm(attn, prev)
        print(context)

def v2():
    # we're going to unroll the loop by vectorizing over the new,
    # 0-th dimension of `x`. We repeat it as many times as there
    # are iterations in the for loop
    repeated = x.unsqueeze(0).repeat(x.size(0), 1, 1)

    # we're looking to build a `prevs` tensor such that
    # prevs[i, x, y] == prev[x, y] at i-th iteration of the loop in v1,
    # up to 0-padding necessary to make them all the same size.
    # We need to build a higher-dimensional equivalent of torch.triu
    xs = torch.arange(x.size(0)).reshape(1, -1, 1)
    zs = torch.arange(x.size(0)).reshape(-1, 1, 1)
    prevs = torch.where(zs < xs, torch.tensor(0.), repeated)

    # this is an equivalent of the above iteration starting at 1
    prevs = prevs[:-1]
    currs = x[1:]

    # a batched matrix multiplication
    prod = torch.matmul(currs, prevs.transpose(1, 2))
    attn = prod # same shape
    context = torch.matmul(attn, prevs)
    # equivalent of a higher dimensional torch.diagonal
    contexts = torch.einsum('iij->ij', (context))
    print(contexts)

print(x)

print('\n------ v1 -------\n')
v1()
print('\n------ v2 -------\n')
v2()

which vectorizes your loop, with some caveats. First, I assume x is 2-dimensional. Secondly, I skip taking the softmax claiming it doesn't change the size of the input and thus doesn't affect vectorization. That's a true, but unfortunately softmax of a 0-padded vector v is not equal to a 0-padded softmax of unpadded v. This can be fixed with renormalization though. Please let me know if my assumptions are correct and whether this is a good enough starting point for your work.

Question:

I would like to know if the following MATLAB/Octave code can be vectorized?

function grads = compute_grads(data, ann, lambda)
    [~, N] = size(data.X);
    % First propagate the data
    S = evaluate(data.X, ann);
    G = -(data.Y - S{2});

    % Second layer gradient is easy.
    l2g.W = G*S{1}';
    l2g.b = mean(G)';
    G = G' * ann{2}.W;

    [m, d] = size(ann{1}.W);
    [K, ~] = size(ann{2}.W);

    % I would like to vectorize this calculation.
    l1g.W = zeros(m, d);
    l1g.b = mean(G)';
    for i = 1:N
        x = data.X(:, i);
        g = G(i, :);
        l1 = S{1}(:, i);
        g = g * diag(l1 > 0);
        l1g.W = l1g.W + g'*x';
    end
    grads = {l1g, l2g};
    for k=1:length(grads)
        grads{k}.W = grads{k}.W/N + 2*lambda*ann{k}.W;
    end
end

The code computes the gradients for a two-layer neural network. The second layer has a softmax activation function as shown by line 4 G = -(data.Y - S{2});. The first layer has ReLU activation implemented by the gunk in the for-loop which operates on each sample at a time.

As you can see, there is an explicit for-loop in the middle. Are there any array/matrix functions one can use instead to make the looping implicit?


Answer:

The loop can be reduced to:

l1g.W = (data.X * (G .* (S{1} > 0).')).';

Explanation:

In vectorization we should avoid unnecessary operations. For example in

g = g * diag(l1 > 0);;

we can use element-wize multiplication to achieve the same thing:

g = g .* (l1.' > 0);
%or
g = g .* (l1 > 0).';

Using that we can place some operations outside of the loop:

l1g.W = zeros(m, d);

G = G .* (S{1} > 0).';

for i = 1:N
    x = data.X(:, i);
    g = G(i, :);
    l1g.W = l1g.W + g'*x';
end

So we have something like this:

W=0;
for i = 1:N
    W = W + something(i);
end

that can be written as:

W = sum(something);

Our loop can be reduced to:

l1g.W = sum(some_structrue_created_by_vectorizing(g'*x'));

We can use functions such as bsxfun to create such a structure (i.e. a 3D matrix) but often such a structure requires a large amount of memory and the loop may be more efficient than vectorization. But wait we want to do sum of product of g and x so we can [and should always] think about using vector-matrix or matrix-matrix multiplication because they are very fast operations. As we are performing outer product of g and x so matrix-matrix multiplication is the right choice.

G = G .* (S{1} > 0).';
l1g.W  = (data.X * G).'

or

l1g.W = (data.X * (G .* (S{1} > 0).')).';

Question:

The code below is correct, but I want to vectorize it (and may convert to GPU) to increase the speed.

How can I convert it to vector form?

RF = 4;     
inhibatory = 0;    
overlap=3;   
act_funct = 'sig';
gap = RF-overlap;    
Image1 = rand(30,22);  
Image2 = rand(27,19); % size_image2 is equal to 27x19
Image3 = rand(30,22); 
de_act_output = de_activate_Mat(Image1,act_funct); % finding derivative of the matrix. e.g. de_act_output = act_output.*(1-act_output) in case of sigmoid. 
for u=1:size(Image1,1)
    for v=1:size(Image1,2)
        sum_val=0;
        iLowMax=max(ceil((u-(RF+inhibatory))/(gap-inhibatory)),1);
        iHighMax=min(floor((u-1)/(gap-inhibatory))+1, size_image2(1));
        jLowMax=max(ceil((v-(RF+inhibatory))/(gap-inhibatory)),1);
        jHighMax = min(floor((v-1)/(gap-inhibatory))+1, size_image2(2));
        sum_sens = sum(sum(Image2(iLowMax:iHighMax,jLowMax:jHighMax)));
        sum_val = sum_sens(:,:) .* Image3(u,v);
        result(u,v) = de_act_output(u,v) .* sum_val;
    end
end

Answer:

There is a parallelogram-like structure of blocks you are creating inside the nested loops with iLowMax:iHighMax,jLowMax:jHighMax which won't lead to any easy vectorizable codes. But you can go full-throttle vectorization on that if performance is paramount for your case and seems like convolution would be of good use there. Listing here are some tweaks to make it faster everything around that step by pre-calculating most other stuffs and this must result in appreciable speedup. Here's the implementation -

U = 1:size(Image1,1); %// Create arrays of iteration steps
V = 1:size(Image1,2);

%// Calculate arrays of low-high row and column indices 
iLowMax=max(ceil((U-(RF+inhibatory))/(gap-inhibatory)),1);
iHighMax=min(floor((U-1)/(gap-inhibatory))+1, size_image2(1));

jLowMax=max(ceil((V-(RF+inhibatory))/(gap-inhibatory)),1);
jHighMax = min(floor((V-1)/(gap-inhibatory))+1, size_image2(2));

sens_sums(size(Image1,1),size(Image1,2)) = 0; %// Pre-allocation
for u=1:size(Image1,1)
    for v=1:size(Image1,2)
        sens = Image2(iLowMax(u):iHighMax(u),jLowMax(v):jHighMax(v));
        sens_sums(u,v) = sum(sens(:));
    end
end
result = sens_sums.*Image3.*de_act_output;