## Hot questions for Using Neural networks in vectorization

Question:

I have build a RNN language model with attention and I am creating context vector for every element of the input by attending all the previous hidden states *(only one direction)*.

The most straight forward solution in my opinion is using a *for-loop* over the RNN output, such that each context vector is computed one after another.

import torch import torch.nn as nn import torch.nn.functional as F class RNN_LM(nn.Module): def __init__(self, hidden_size, vocab_size, embedding_dim=None, droprate=0.5): super().__init__() if not embedding_dim: embedding_dim = hidden_size self.embedding_matrix = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=False) self.attn = nn.Linear(hidden_size, hidden_size) self.vocab_dist = nn.Linear(hidden_size, vocab_size) self.dropout = nn.Dropout(droprate) def forward(self, x): x = self.dropout(self.embedding_matrix(x.view(-1, 1))) x, states = self.lstm(x) #print(x.size()) x = x.squeeze() content_vectors = [x[0].view(1, -1)] # for-loop over hidden states and attention for i in range(1, x.size(0)): prev_states = x[:i] current_state = x[i].view(1, -1) attn_prod = torch.mm(self.attn(current_state), prev_states.t()) attn_weights = F.softmax(attn_prod, dim=1) context = torch.mm(attn_weights, prev_states) content_vectors.append(context) return self.vocab_dist(self.dropout(torch.cat(content_vectors)))

*Note: The forward method here is only used for training.*

However this solution is not very efficient as the code is not well parallelizable with computing each context vector sequently. But since the context vectors are not dependent on each other, I wonder if there is a non-sequential way of calculating them.

**So is there is a way to compute the context vectors without for-loop so that more of computation can be parallelized?**

Answer:

Ok, for clarity: I assume we only really care about vectorizing the `for`

loop. What is the shape of `x`

? Assuming `x`

is 2-dimensional, I have the following code, where `v1`

executes your loop and `v2`

is a vectorized version:

import torch import torch.nn.functional as F torch.manual_seed(0) x = torch.randn(3, 6) def v1(): for i in range(1, x.size(0)): prev = x[:i] curr = x[i].view(1, -1) prod = torch.mm(curr, prev.t()) attn = prod # same shape context = torch.mm(attn, prev) print(context) def v2(): # we're going to unroll the loop by vectorizing over the new, # 0-th dimension of `x`. We repeat it as many times as there # are iterations in the for loop repeated = x.unsqueeze(0).repeat(x.size(0), 1, 1) # we're looking to build a `prevs` tensor such that # prevs[i, x, y] == prev[x, y] at i-th iteration of the loop in v1, # up to 0-padding necessary to make them all the same size. # We need to build a higher-dimensional equivalent of torch.triu xs = torch.arange(x.size(0)).reshape(1, -1, 1) zs = torch.arange(x.size(0)).reshape(-1, 1, 1) prevs = torch.where(zs < xs, torch.tensor(0.), repeated) # this is an equivalent of the above iteration starting at 1 prevs = prevs[:-1] currs = x[1:] # a batched matrix multiplication prod = torch.matmul(currs, prevs.transpose(1, 2)) attn = prod # same shape context = torch.matmul(attn, prevs) # equivalent of a higher dimensional torch.diagonal contexts = torch.einsum('iij->ij', (context)) print(contexts) print(x) print('\n------ v1 -------\n') v1() print('\n------ v2 -------\n') v2()

which vectorizes your loop, with some caveats. First, I assume `x`

is 2-dimensional. Secondly, I skip taking the `softmax`

claiming it doesn't change the size of the input and thus doesn't affect vectorization. That's a true, but unfortunately softmax of a 0-padded vector `v`

is not equal to a 0-padded softmax of unpadded `v`

. This can be fixed with renormalization though. Please let me know if my assumptions are correct and whether this is a good enough starting point for your work.

Question:

I would like to know if the following MATLAB/Octave code can be vectorized?

function grads = compute_grads(data, ann, lambda) [~, N] = size(data.X); % First propagate the data S = evaluate(data.X, ann); G = -(data.Y - S{2}); % Second layer gradient is easy. l2g.W = G*S{1}'; l2g.b = mean(G)'; G = G' * ann{2}.W; [m, d] = size(ann{1}.W); [K, ~] = size(ann{2}.W); % I would like to vectorize this calculation. l1g.W = zeros(m, d); l1g.b = mean(G)'; for i = 1:N x = data.X(:, i); g = G(i, :); l1 = S{1}(:, i); g = g * diag(l1 > 0); l1g.W = l1g.W + g'*x'; end grads = {l1g, l2g}; for k=1:length(grads) grads{k}.W = grads{k}.W/N + 2*lambda*ann{k}.W; end end

The code computes the gradients for a two-layer neural network. The second layer has a softmax activation function as shown by line 4 `G = -(data.Y - S{2});`

. The first layer has ReLU activation implemented by the gunk in the `for`

-loop which operates on each sample at a time.

As you can see, there is an explicit `for`

-loop in the middle. Are there any array/matrix functions one can use instead to make the looping implicit?

Answer:

The loop can be reduced to:

l1g.W = (data.X * (G .* (S{1} > 0).')).';

*Explanation:*

In vectorization we should avoid unnecessary operations. For example in

g = g * diag(l1 > 0);;

we can use element-wize multiplication to achieve the same thing:

g = g .* (l1.' > 0); %or g = g .* (l1 > 0).';

Using that we can place some operations outside of the loop:

l1g.W = zeros(m, d); G = G .* (S{1} > 0).'; for i = 1:N x = data.X(:, i); g = G(i, :); l1g.W = l1g.W + g'*x'; end

So we have something like this:

W=0; for i = 1:N W = W + something(i); end

that can be written as:

W = sum(something);

Our loop can be reduced to:

l1g.W = sum(some_structrue_created_by_vectorizing(g'*x'));

We can use functions such as `bsxfun`

to create such a structure (i.e. a 3D matrix) but often such a structure requires a large amount of memory and the loop may be more efficient than vectorization. But wait we want to do sum of *product* of `g`

and `x`

so we can [and should always] think about using vector-matrix or matrix-matrix multiplication because they are very fast operations.
As we are performing outer product of `g`

and `x`

so matrix-matrix multiplication is the right choice.

G = G .* (S{1} > 0).'; l1g.W = (data.X * G).'

or

l1g.W = (data.X * (G .* (S{1} > 0).')).';

Question:

The code below is correct, but I want to vectorize it (and may convert to GPU) to increase the speed.

How can I convert it to vector form?

RF = 4; inhibatory = 0; overlap=3; act_funct = 'sig'; gap = RF-overlap; Image1 = rand(30,22); Image2 = rand(27,19); % size_image2 is equal to 27x19 Image3 = rand(30,22); de_act_output = de_activate_Mat(Image1,act_funct); % finding derivative of the matrix. e.g. de_act_output = act_output.*(1-act_output) in case of sigmoid. for u=1:size(Image1,1) for v=1:size(Image1,2) sum_val=0; iLowMax=max(ceil((u-(RF+inhibatory))/(gap-inhibatory)),1); iHighMax=min(floor((u-1)/(gap-inhibatory))+1, size_image2(1)); jLowMax=max(ceil((v-(RF+inhibatory))/(gap-inhibatory)),1); jHighMax = min(floor((v-1)/(gap-inhibatory))+1, size_image2(2)); sum_sens = sum(sum(Image2(iLowMax:iHighMax,jLowMax:jHighMax))); sum_val = sum_sens(:,:) .* Image3(u,v); result(u,v) = de_act_output(u,v) .* sum_val; end end

Answer:

There is a `parallelogram-like`

structure of blocks you are creating inside the nested loops with `iLowMax:iHighMax,jLowMax:jHighMax`

which won't lead
to any easy vectorizable codes. But you can go full-throttle vectorization on that if performance is paramount for your case and seems like `convolution`

would be of good use there. Listing here are some tweaks to
make it faster everything around that step by pre-calculating most other stuffs and this must result in appreciable speedup. Here's the implementation -

U = 1:size(Image1,1); %// Create arrays of iteration steps V = 1:size(Image1,2); %// Calculate arrays of low-high row and column indices iLowMax=max(ceil((U-(RF+inhibatory))/(gap-inhibatory)),1); iHighMax=min(floor((U-1)/(gap-inhibatory))+1, size_image2(1)); jLowMax=max(ceil((V-(RF+inhibatory))/(gap-inhibatory)),1); jHighMax = min(floor((V-1)/(gap-inhibatory))+1, size_image2(2)); sens_sums(size(Image1,1),size(Image1,2)) = 0; %// Pre-allocation for u=1:size(Image1,1) for v=1:size(Image1,2) sens = Image2(iLowMax(u):iHighMax(u),jLowMax(v):jHighMax(v)); sens_sums(u,v) = sum(sens(:)); end end result = sens_sums.*Image3.*de_act_output;