## Hot questions for Using Neural networks in gated recurrent unit

Top 10 Python Open Source / Neural networks / gated recurrent unit

Question:

Following code of Tensorflow's `GRUCell` unit shows typical operations to get a updated hidden state, when previous hidden state is provided along with current input in the sequence.

```  def __call__(self, inputs, state, scope=None):
"""Gated recurrent unit (GRU) with nunits cells."""
with vs.variable_scope(scope or type(self).__name__):  # "GRUCell"
with vs.variable_scope("Gates"):  # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
r, u = array_ops.split(1, 2, _linear([inputs, state],
2 * self._num_units, True, 1.0))
r, u = sigmoid(r), sigmoid(u)
with vs.variable_scope("Candidate"):
c = self._activation(_linear([inputs, r * state],
self._num_units, True))
new_h = u * state + (1 - u) * c
return new_h, new_h
```

But I don't see any `weights` and `biases` here. e.g. my understanding was that getting `r` and `u` would require weights and biases to be multiplied with current input and/or hidden state to get an updated hidden state.

I have written a gru unit as follows:

```def gru_unit(previous_hidden_state, x):
r  = tf.sigmoid(tf.matmul(x, Wr) + br)
z  = tf.sigmoid(tf.matmul(x, Wz) + bz)
h_ = tf.tanh(tf.matmul(x, Wx) + tf.matmul(previous_hidden_state, Wh) * r)
current_hidden_state = tf.mul((1 - z), h_) + tf.mul(previous_hidden_state, z)
return current_hidden_state
```

Here I explicitly make use of weights `Wx, Wr, Wz, Wh` and biases `br, bh, bz`, etc. to get updated hidden state. These weights and biases are what get learned/tuned after training.

How can I make use of Tensorflow's built-in `GRUCell` to achieve the same result as above?

Answer:

They are there you just don't see them in that code because the _linear function adds the weights and biases.

```r, u = array_ops.split(1, 2, _linear([inputs, state],
2 * self._num_units, True, 1.0))
```

...

```def _linear(args, output_size, bias, bias_start=0.0, scope=None):
"""Linear map: sum_i(args[i] * W[i]), where W[i] is a variable.

Args:
args: a 2D Tensor or a list of 2D, batch x n, Tensors.
output_size: int, second dimension of W[i].
bias: boolean, whether to add a bias term or not.
bias_start: starting value to initialize the bias; 0 by default.
scope: VariableScope for the created subgraph; defaults to "Linear".

Returns:
A 2D Tensor with shape [batch x output_size] equal to
sum_i(args[i] * W[i]), where W[i]s are newly created matrices.

Raises:
ValueError: if some of the arguments has unspecified or wrong shape.
"""
if args is None or (nest.is_sequence(args) and not args):
raise ValueError("`args` must be specified")
if not nest.is_sequence(args):
args = [args]

# Calculate the total size of arguments on dimension 1.
total_arg_size = 0
shapes = [a.get_shape().as_list() for a in args]
for shape in shapes:
if len(shape) != 2:
raise ValueError("Linear is expecting 2D arguments: %s" % str(shapes))
if not shape:
raise ValueError("Linear expects shape of arguments: %s" % str(shapes))
else:
total_arg_size += shape

# Now the computation.
with vs.variable_scope(scope or "Linear"):
matrix = vs.get_variable("Matrix", [total_arg_size, output_size])
if len(args) == 1:
res = math_ops.matmul(args, matrix)
else:
res = math_ops.matmul(array_ops.concat(1, args), matrix)
if not bias:
return res
bias_term = vs.get_variable(
"Bias", [output_size],
initializer=init_ops.constant_initializer(bias_start))
return res + bias_term
```

Question:

Based on the LSTM code provided in the official Theano tutorial (http://deeplearning.net/tutorial/code/lstm.py), I changed the LSTM layer code (i.e. the functions `lstm_layer()` and `param_init_lstm()`) to perform a GRU instead.

The provided LSTM code trains well, but not the GRU I coded: the accuracy on the training set with the LSTM goes up to 1 (train cost = 0), while with the GRU it stagnates at 0.7 (train cost = 0.3).

Below is the code I use for the GRU. I kept the same function names as in tutorial, so that one can copy paste the code directly in it. What could explain the poor performance of the GRU?

```import numpy as np
def param_init_lstm(options, params, prefix='lstm'):
"""
GRU
"""
W = np.concatenate([ortho_weight(options['dim_proj']),  # Weight matrix for the input in the reset gate
ortho_weight(options['dim_proj']),
ortho_weight(options['dim_proj'])], # Weight matrix for the input in the update gate
axis=1)
params[_p(prefix, 'W')] = W

U = np.concatenate([ortho_weight(options['dim_proj']),  # Weight matrix for the previous hidden state in the reset gate
ortho_weight(options['dim_proj']),
ortho_weight(options['dim_proj'])], # Weight matrix for the previous hidden state in the update gate
axis=1)
params[_p(prefix, 'U')] = U

b = np.zeros((3 * options['dim_proj'],)) # Biases for the reset gate and the update gate
params[_p(prefix, 'b')] = b.astype(config.floatX)
return params

def lstm_layer(tparams, state_below, options, prefix='lstm', mask=None):
nsteps = state_below.shape
if state_below.ndim == 3:
n_samples = state_below.shape
else:
n_samples = 1

def _slice(_x, n, dim):
if _x.ndim == 3:
return _x[:, :, n * dim:(n + 1) * dim]
return _x[:, n * dim:(n + 1) * dim]

def _step(m_, x_, h_):
preact = tensor.dot(h_, tparams[_p(prefix, 'U')])
preact += x_

r = tensor.nnet.sigmoid(_slice(preact, 0, options['dim_proj'])) # reset gate
u = tensor.nnet.sigmoid(_slice(preact, 1, options['dim_proj'])) # update gate

U_h_t = _slice( tparams[_p(prefix, 'U')], 2, options['dim_proj'])
x_h_t = _slice( x_, 2, options['dim_proj'])

h_t_temp = tensor.tanh(tensor.dot(r*h_, U_h_t) + x_h_t)
h = (1. - u) * h_ + u * h_t_temp
h = m_[:,None] * h + (1. - m_)[:,None] * h_

return h

state_below = (tensor.dot(state_below, tparams[_p(prefix, 'W')]) +
tparams[_p(prefix, 'b')])

dim_proj = options['dim_proj']
rval, updates = theano.scan(_step,
sequences=[mask, state_below],
outputs_info=[tensor.alloc(numpy_floatX(0.),
n_samples,
dim_proj)],
name=_p(prefix, '_layers'),
n_steps=nsteps)

return rval
```

Answer:

The issue comes from the last line, `return rval`: it should instead be `return rval`.

The LSTM code provided in the official Theano tutorial (http://deeplearning.net/tutorial/code/lstm.py) uses `return rval` because `outputs_info` contains 2 elements:

```rval, updates = theano.scan(_step,
sequences=[mask, state_below],
outputs_info=[tensor.alloc(numpy_floatX(0.),
n_samples,
dim_proj),
tensor.alloc(numpy_floatX(0.),
n_samples,
dim_proj)],
name=_p(prefix, '_layers'),
n_steps=nsteps)
return rval
```

In the GRU, `outputs_info` contains just one element:

```outputs_info=[tensor.alloc(numpy_floatX(0.),
n_samples,
dim_proj)],
```

and despite the brackets, it won't return a list of a list of Theano variables representing the outputs of scan, but directly a Theano variable.

The `rval` is then fed to a pooling layer (in this case, a mean pooling layer): By taking only `rval` in the GRU, since in the GRU code `rval` is a Theano variable and not a list of a Theano variables, you removed the part in the red rectangle: which means you tried to perform the sentence classification just using the first word.

Another GRU implementation that can be plugged in the LSTM tutorial:

```# weight initializer, normal by default
def norm_weight(nin, nout=None, scale=0.01, ortho=True):
if nout is None:
nout = nin
if nout == nin and ortho:
W = ortho_weight(nin)
else:
W = scale * numpy.random.randn(nin, nout)
return W.astype('float32')

def param_init_lstm(options, params, prefix='lstm'):
"""
GRU. Source: https://github.com/kyunghyuncho/dl4mt-material/blob/master/session0/lm.py
"""
nin = options['dim_proj']
dim = options['dim_proj']
# embedding to gates transformation weights, biases
W = numpy.concatenate([norm_weight(nin, dim),
norm_weight(nin, dim)], axis=1)
params[_p(prefix, 'W')] = W
params[_p(prefix, 'b')] = numpy.zeros((2 * dim,)).astype('float32')

# recurrent transformation weights for gates
U = numpy.concatenate([ortho_weight(dim),
ortho_weight(dim)], axis=1)
params[_p(prefix, 'U')] = U

# embedding to hidden state proposal weights, biases
Wx = norm_weight(nin, dim)
params[_p(prefix, 'Wx')] = Wx
params[_p(prefix, 'bx')] = numpy.zeros((dim,)).astype('float32')

# recurrent transformation weights for hidden state proposal
Ux = ortho_weight(dim)
params[_p(prefix, 'Ux')] = Ux
return params

def lstm_layer(tparams, state_below, options, prefix='lstm', mask=None):

nsteps = state_below.shape

if state_below.ndim == 3:
n_samples = state_below.shape
else:
n_samples = state_below.shape

dim = tparams[_p(prefix, 'Ux')].shape

if mask is None:
mask = tensor.alloc(1., state_below.shape, 1)

# utility function to slice a tensor
def _slice(_x, n, dim):
if _x.ndim == 3:
return _x[:, :, n*dim:(n+1)*dim]
return _x[:, n*dim:(n+1)*dim]

# state_below is the input word embeddings
# input to the gates, concatenated
state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) + \
tparams[_p(prefix, 'b')]
# input to compute the hidden state proposal
state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) + \
tparams[_p(prefix, 'bx')]

# step function to be used by scan
# arguments    | sequences |outputs-info| non-seqs
def _step_slice(m_, x_, xx_,  h_,          U, Ux):
preact = tensor.dot(h_, U)
preact += x_

# reset and update gates
r = tensor.nnet.sigmoid(_slice(preact, 0, dim))
u = tensor.nnet.sigmoid(_slice(preact, 1, dim))

# compute the hidden state proposal
preactx = tensor.dot(h_, Ux)
preactx = preactx * r
preactx = preactx + xx_

# hidden state proposal
h = tensor.tanh(preactx)

# leaky integrate and obtain next hidden state
h = u * h_ + (1. - u) * h
h = m_[:, None] * h + (1. - m_)[:, None] * h_

return h

# prepare scan arguments
seqs = [mask, state_below_, state_belowx]
_step = _step_slice
shared_vars = [tparams[_p(prefix, 'U')],
tparams[_p(prefix, 'Ux')]]

init_state = tensor.unbroadcast(tensor.alloc(0., n_samples, dim), 0)

rval, updates = theano.scan(_step,
sequences=seqs,
outputs_info=[init_state],
non_sequences=shared_vars,
name=_p(prefix, '_layers'),
n_steps=nsteps,
strict=True)
return rval
```

As a side note, Keras fixed this issue as follows:

```results, _ = theano.scan(
_step,
sequences=inputs,
outputs_info=[None] + initial_states,
go_backwards=go_backwards)

# deal with Theano API inconsistency
if type(results) is list:
outputs = results
states = results[1:]
else:
outputs = results
states = []
```

Question:

Could somebody explain the similarities and dissimilarities between Long Short Term Memory(LSTM) and Gated Recurrent Unit(GRU) architectures. I know the definitions of each and that GRU lack an output gate and therefore have fewer parameters. Could somebody please give an intuitive explanation / analogy.

Answer:

I'm not familiar with the GRU architecture. However, this paper compares the LSTM and GRU architectures, I think it's exactly what you need.

https://arxiv.org/pdf/1412.3555v1.pdf

Question:

I know that applying a TimeDistributed(Dense) applies the same dense layer over all the timesteps but I wanted to know how to apply different dense layers for each timestep. The number of timesteps is not variable.

P.S.: I have seen the following link and can't seem to find an answer

Answer:

You can use a LocallyConnected layer.

The LocallyConnected layer words as a Dense layer connected to each of `kernel_size` time_steps (1 in this case).

```from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

sequence_length = 10
n_features = 4

def make_model():
inp = Input((sequence_length, n_features))
h1 = LocallyConnected1D(8, 1, 1)(inp)
out = Flatten()(h1)
model = Model(inp, out)
model.compile('adam', 'mse')
return model

model = make_model()
model.summary()
```

Per summary the number of variables used by the LocallyConnected layer is `(output_dims * (input_dims + bias)) * time_steps` or (8 * (4 + 1)) * 10 = 400.

Wording it another way: the locally connected layer above behaves as 10 different Dense layers each connected to its time step (because we choose kernel_size as 1). Each of these blocks of 50 variables, is a weights matrix of shape (input_dims, output_dims) plus a bias vector of size (output_dims).

Also note that given an input_shape of (sequence_len, n_features), `Dense(output_dims)` and `Conv1D(output_dims, 1, 1)` are equivalent.

i.e. this model:

```def make_model():
inp = Input((sequence_length, n_features))
h1 = Conv1D(8, 1, 1)(inp)
out = Flatten()(h1)
model = Model(inp, out)
```

and this model:

```def make_model():
inp = Input((sequence_length, n_features))
h1 = Dense(8)(inp)
out = Flatten()(h1)
model = Model(inp, out)
```

Are the same.