## Hot questions for Using Neural networks in gated recurrent unit

Question:

Following code of Tensorflow's `GRUCell`

unit shows typical operations to get a updated hidden state, when previous hidden state is provided along with current input in the sequence.

def __call__(self, inputs, state, scope=None): """Gated recurrent unit (GRU) with nunits cells.""" with vs.variable_scope(scope or type(self).__name__): # "GRUCell" with vs.variable_scope("Gates"): # Reset gate and update gate. # We start with bias of 1.0 to not reset and not update. r, u = array_ops.split(1, 2, _linear([inputs, state], 2 * self._num_units, True, 1.0)) r, u = sigmoid(r), sigmoid(u) with vs.variable_scope("Candidate"): c = self._activation(_linear([inputs, r * state], self._num_units, True)) new_h = u * state + (1 - u) * c return new_h, new_h

But I don't see any `weights`

and `biases`

here.
e.g. my understanding was that getting `r`

and `u`

would require weights and biases to be multiplied with current input and/or hidden state to get an updated hidden state.

I have written a gru unit as follows:

def gru_unit(previous_hidden_state, x): r = tf.sigmoid(tf.matmul(x, Wr) + br) z = tf.sigmoid(tf.matmul(x, Wz) + bz) h_ = tf.tanh(tf.matmul(x, Wx) + tf.matmul(previous_hidden_state, Wh) * r) current_hidden_state = tf.mul((1 - z), h_) + tf.mul(previous_hidden_state, z) return current_hidden_state

Here I explicitly make use of weights `Wx, Wr, Wz, Wh`

and biases `br, bh, bz`

, etc. to get updated hidden state. These weights and biases are what get learned/tuned after training.

How can I make use of Tensorflow's built-in `GRUCell`

to achieve the same result as above?

Answer:

They are there you just don't see them in that code because the _linear function adds the weights and biases.

r, u = array_ops.split(1, 2, _linear([inputs, state], 2 * self._num_units, True, 1.0))

...

def _linear(args, output_size, bias, bias_start=0.0, scope=None): """Linear map: sum_i(args[i] * W[i]), where W[i] is a variable. Args: args: a 2D Tensor or a list of 2D, batch x n, Tensors. output_size: int, second dimension of W[i]. bias: boolean, whether to add a bias term or not. bias_start: starting value to initialize the bias; 0 by default. scope: VariableScope for the created subgraph; defaults to "Linear". Returns: A 2D Tensor with shape [batch x output_size] equal to sum_i(args[i] * W[i]), where W[i]s are newly created matrices. Raises: ValueError: if some of the arguments has unspecified or wrong shape. """ if args is None or (nest.is_sequence(args) and not args): raise ValueError("`args` must be specified") if not nest.is_sequence(args): args = [args] # Calculate the total size of arguments on dimension 1. total_arg_size = 0 shapes = [a.get_shape().as_list() for a in args] for shape in shapes: if len(shape) != 2: raise ValueError("Linear is expecting 2D arguments: %s" % str(shapes)) if not shape[1]: raise ValueError("Linear expects shape[1] of arguments: %s" % str(shapes)) else: total_arg_size += shape[1] # Now the computation. with vs.variable_scope(scope or "Linear"): matrix = vs.get_variable("Matrix", [total_arg_size, output_size]) if len(args) == 1: res = math_ops.matmul(args[0], matrix) else: res = math_ops.matmul(array_ops.concat(1, args), matrix) if not bias: return res bias_term = vs.get_variable( "Bias", [output_size], initializer=init_ops.constant_initializer(bias_start)) return res + bias_term

Question:

Based on the LSTM code provided in the official Theano tutorial (http://deeplearning.net/tutorial/code/lstm.py), I changed the LSTM layer code (i.e. the functions `lstm_layer()`

and `param_init_lstm()`

) to perform a GRU instead.

The provided LSTM code trains well, but not the GRU I coded: the accuracy on the training set with the LSTM goes up to 1 (train cost = 0), while with the GRU it stagnates at 0.7 (train cost = 0.3).

Below is the code I use for the GRU. I kept the same function names as in tutorial, so that one can copy paste the code directly in it. What could explain the poor performance of the GRU?

import numpy as np def param_init_lstm(options, params, prefix='lstm'): """ GRU """ W = np.concatenate([ortho_weight(options['dim_proj']), # Weight matrix for the input in the reset gate ortho_weight(options['dim_proj']), ortho_weight(options['dim_proj'])], # Weight matrix for the input in the update gate axis=1) params[_p(prefix, 'W')] = W U = np.concatenate([ortho_weight(options['dim_proj']), # Weight matrix for the previous hidden state in the reset gate ortho_weight(options['dim_proj']), ortho_weight(options['dim_proj'])], # Weight matrix for the previous hidden state in the update gate axis=1) params[_p(prefix, 'U')] = U b = np.zeros((3 * options['dim_proj'],)) # Biases for the reset gate and the update gate params[_p(prefix, 'b')] = b.astype(config.floatX) return params def lstm_layer(tparams, state_below, options, prefix='lstm', mask=None): nsteps = state_below.shape[0] if state_below.ndim == 3: n_samples = state_below.shape[1] else: n_samples = 1 def _slice(_x, n, dim): if _x.ndim == 3: return _x[:, :, n * dim:(n + 1) * dim] return _x[:, n * dim:(n + 1) * dim] def _step(m_, x_, h_): preact = tensor.dot(h_, tparams[_p(prefix, 'U')]) preact += x_ r = tensor.nnet.sigmoid(_slice(preact, 0, options['dim_proj'])) # reset gate u = tensor.nnet.sigmoid(_slice(preact, 1, options['dim_proj'])) # update gate U_h_t = _slice( tparams[_p(prefix, 'U')], 2, options['dim_proj']) x_h_t = _slice( x_, 2, options['dim_proj']) h_t_temp = tensor.tanh(tensor.dot(r*h_, U_h_t) + x_h_t) h = (1. - u) * h_ + u * h_t_temp h = m_[:,None] * h + (1. - m_)[:,None] * h_ return h state_below = (tensor.dot(state_below, tparams[_p(prefix, 'W')]) + tparams[_p(prefix, 'b')]) dim_proj = options['dim_proj'] rval, updates = theano.scan(_step, sequences=[mask, state_below], outputs_info=[tensor.alloc(numpy_floatX(0.), n_samples, dim_proj)], name=_p(prefix, '_layers'), n_steps=nsteps) return rval[0]

Answer:

The issue comes from the last line, `return rval[0]`

: it should instead be `return rval`

.

The LSTM code provided in the official Theano tutorial (http://deeplearning.net/tutorial/code/lstm.py) uses `return rval[0]`

because `outputs_info`

contains 2 elements:

rval, updates = theano.scan(_step, sequences=[mask, state_below], outputs_info=[tensor.alloc(numpy_floatX(0.), n_samples, dim_proj), tensor.alloc(numpy_floatX(0.), n_samples, dim_proj)], name=_p(prefix, '_layers'), n_steps=nsteps) return rval[0]

In the GRU, `outputs_info`

contains just one element:

outputs_info=[tensor.alloc(numpy_floatX(0.), n_samples, dim_proj)],

and despite the brackets, it won't return a list of a list of Theano variables representing the outputs of scan, but directly a Theano variable.

The `rval`

is then fed to a pooling layer (in this case, a mean pooling layer):

By taking only `rval[0]`

in the GRU, since in the GRU code `rval`

is a Theano variable and not a list of a Theano variables, you removed the part in the red rectangle:

which means you tried to perform the sentence classification just using the first word.

Another GRU implementation that can be plugged in the LSTM tutorial:

# weight initializer, normal by default def norm_weight(nin, nout=None, scale=0.01, ortho=True): if nout is None: nout = nin if nout == nin and ortho: W = ortho_weight(nin) else: W = scale * numpy.random.randn(nin, nout) return W.astype('float32') def param_init_lstm(options, params, prefix='lstm'): """ GRU. Source: https://github.com/kyunghyuncho/dl4mt-material/blob/master/session0/lm.py """ nin = options['dim_proj'] dim = options['dim_proj'] # embedding to gates transformation weights, biases W = numpy.concatenate([norm_weight(nin, dim), norm_weight(nin, dim)], axis=1) params[_p(prefix, 'W')] = W params[_p(prefix, 'b')] = numpy.zeros((2 * dim,)).astype('float32') # recurrent transformation weights for gates U = numpy.concatenate([ortho_weight(dim), ortho_weight(dim)], axis=1) params[_p(prefix, 'U')] = U # embedding to hidden state proposal weights, biases Wx = norm_weight(nin, dim) params[_p(prefix, 'Wx')] = Wx params[_p(prefix, 'bx')] = numpy.zeros((dim,)).astype('float32') # recurrent transformation weights for hidden state proposal Ux = ortho_weight(dim) params[_p(prefix, 'Ux')] = Ux return params def lstm_layer(tparams, state_below, options, prefix='lstm', mask=None): nsteps = state_below.shape[0] if state_below.ndim == 3: n_samples = state_below.shape[1] else: n_samples = state_below.shape[0] dim = tparams[_p(prefix, 'Ux')].shape[1] if mask is None: mask = tensor.alloc(1., state_below.shape[0], 1) # utility function to slice a tensor def _slice(_x, n, dim): if _x.ndim == 3: return _x[:, :, n*dim:(n+1)*dim] return _x[:, n*dim:(n+1)*dim] # state_below is the input word embeddings # input to the gates, concatenated state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) + \ tparams[_p(prefix, 'b')] # input to compute the hidden state proposal state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) + \ tparams[_p(prefix, 'bx')] # step function to be used by scan # arguments | sequences |outputs-info| non-seqs def _step_slice(m_, x_, xx_, h_, U, Ux): preact = tensor.dot(h_, U) preact += x_ # reset and update gates r = tensor.nnet.sigmoid(_slice(preact, 0, dim)) u = tensor.nnet.sigmoid(_slice(preact, 1, dim)) # compute the hidden state proposal preactx = tensor.dot(h_, Ux) preactx = preactx * r preactx = preactx + xx_ # hidden state proposal h = tensor.tanh(preactx) # leaky integrate and obtain next hidden state h = u * h_ + (1. - u) * h h = m_[:, None] * h + (1. - m_)[:, None] * h_ return h # prepare scan arguments seqs = [mask, state_below_, state_belowx] _step = _step_slice shared_vars = [tparams[_p(prefix, 'U')], tparams[_p(prefix, 'Ux')]] init_state = tensor.unbroadcast(tensor.alloc(0., n_samples, dim), 0) rval, updates = theano.scan(_step, sequences=seqs, outputs_info=[init_state], non_sequences=shared_vars, name=_p(prefix, '_layers'), n_steps=nsteps, strict=True) return rval

As a side note, Keras fixed this issue as follows:

results, _ = theano.scan( _step, sequences=inputs, outputs_info=[None] + initial_states, go_backwards=go_backwards) # deal with Theano API inconsistency if type(results) is list: outputs = results[0] states = results[1:] else: outputs = results states = []

Question:

Could somebody explain the similarities and dissimilarities between Long Short Term Memory(LSTM) and Gated Recurrent Unit(GRU) architectures. I know the definitions of each and that GRU lack an output gate and therefore have fewer parameters. Could somebody please give an intuitive explanation / analogy.

Answer:

I'm not familiar with the GRU architecture. However, this paper compares the LSTM and GRU architectures, I think it's exactly what you need.

Question:

I know that applying a TimeDistributed(Dense) applies the same dense layer over all the timesteps but I wanted to know how to apply different dense layers for each timestep. The number of timesteps is not variable.

P.S.: I have seen the following link and can't seem to find an answer

Answer:

You can use a LocallyConnected layer.

The LocallyConnected layer words as a Dense layer connected to each of `kernel_size`

time_steps (1 in this case).

from tensorflow import keras from tensorflow.keras.layers import * from tensorflow.keras.models import Model sequence_length = 10 n_features = 4 def make_model(): inp = Input((sequence_length, n_features)) h1 = LocallyConnected1D(8, 1, 1)(inp) out = Flatten()(h1) model = Model(inp, out) model.compile('adam', 'mse') return model model = make_model() model.summary()

Per summary the number of variables used by the LocallyConnected layer is
`(output_dims * (input_dims + bias)) * time_steps`

or (8 * (4 + 1)) * 10 = 400.

Wording it another way: the locally connected layer above behaves as 10 different Dense layers each connected to its time step (because we choose kernel_size as 1). Each of these blocks of 50 variables, is a weights matrix of shape (input_dims, output_dims) plus a bias vector of size (output_dims).

Also note that given an input_shape of (sequence_len, n_features), `Dense(output_dims)`

and `Conv1D(output_dims, 1, 1)`

are equivalent.

i.e. this model:

def make_model(): inp = Input((sequence_length, n_features)) h1 = Conv1D(8, 1, 1)(inp) out = Flatten()(h1) model = Model(inp, out)

and this model:

def make_model(): inp = Input((sequence_length, n_features)) h1 = Dense(8)(inp) out = Flatten()(h1) model = Model(inp, out)

Are the same.