why set return_sequences=True and stateful=True for tf.keras.layers.LSTM?

I am learning tensorflow2.0 and follow the tutorial. In the rnn example, I found the code:

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units, 
                        return_sequences=True, 
                        stateful=True, 
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

My question is: why the code set the argument return_sequences=True and stateful=True? How about using the default argument?

The example in the tutorial is about text generation. This is the input that is fed to the network in a batch:

(64, 100, 65) # (batch_size, sequence_length, vocab_size)

  1. return_sequences=True

Since the intention is to predict a character for every time step i.e. for every character in the sequence, the next character needs to be predicted.

So, the argument return_sequences=True is set to true, to get an output shape of (64, 100, 65). If this argument is set to False, then only the last output would be returned, so for batch of 64, output would be (64, 65) i.e. for every sequence of 100 characters, only the last predicted character would be returned.

  1. stateful=True

From the documentation, "If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch."

In the below diagram from the tutorial, you can see that setting stateful helps the LSTM make better predictions by providing the context of the previous prediction.

tf.keras.layers.LSTM, return_state=False, go_backwards=False, stateful=False, time_major=False, unroll=False, **kwargs ) inputs = tf.random.normal([32, 10, 8]) >>> lstm = tf. keras.layers.LSTM(4) LSTM(4, return_sequences=True, return_state=True) >> > whole_seq_output, Setting it to true will also force bias_initializer="zeros" . This is� Stateful Model Training¶. The stateful model gives flexibility of resetting states so you can pass states from batch to batch. However, as a consequence, stateful model requires some book keeping during the training: a set of original time series needs to be trained in the sequential manner and you need to specify when the batch with new sequence starts.

Return Sequences

Lets look at a typical model architectures built using LSTMs.

Sequence to sequence models:

We feed in a sequence of inputs (x's), one batch at a time and each LSTM cell returns an output (y_i). So if your input is of size batch_size x time_steps X input_size then the LSTM output will be batch_size X time_steps X output_size. This is called a sequence to sequence model because an input sequence is converted into an output sequence. Typical usages of this model are in tagger (POS tagger, NER Tagger). In keras this is achieved by setting return_sequences=True.

Sequence classification - Many to one Architecture

In many to one architecture we use output sates of the only the last LSTM cell. This kind of architecture is normally used for classification problems like predicting if a movie review (represented as a sequence of words) is +ve of -ve. In keras if we set return_sequences=False the model returns the output state of only the last LSTM cell.

Stateful

An LSTM cell is composed of many gates as show in figure below from this blog post. The states/gates of the previous cell is used to calculate the state of the current cell. In keras if stateful=False then the states are reset after each batch. If stateful=True the states from the previous batch for index i will be used as initial state for index i in the next batch. So state information get propagated between batches with stateful=True. Check this link for explanation of usefulness of statefulness with an example.

LSTM layer, RNN( cell, return_state=True, return_sequences=True, stateful=True, ) hidden_state, cell_state = lstm(input_layer) output = tf.keras.layers. I traced the delay down to tensorflow.keras.backend.set_value, so it obviously is really setting of the state that takes extremely long. Describe the expected behavior I would expect that the reset of a single LSTM (one model layer) stays constant independent of the number of LSTMs the model consists of.

Let's see the differences when playing around with the arguments:

tf.keras.backend.clear_session()
tf.set_random_seed(42)
X = np.array([[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[0,0,0]]], dtype=np.float32)
model = tf.keras.Sequential([tf.keras.layers.LSTM(4, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')])
print(tf.keras.backend.get_value(model(X)).shape)
# (2, 3, 4)
print(tf.keras.backend.get_value(model(X)))
# [[[-0.16141939  0.05600287  0.15932009  0.15656665]
#  [-0.10788933  0.          0.23865232  0.13983202]
   [-0.          0.          0.23865232  0.0057992 ]]

# [[-0.16141939  0.05600287  0.15932009  0.15656665]
#  [-0.10788933  0.          0.23865232  0.13983202]
#  [-0.07900514  0.07872108  0.06463861  0.29855606]]]

So, if return_sequences is set to True the model returned the full sequence it predicts.

tf.keras.backend.clear_session()
tf.set_random_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.LSTM(4, return_sequences=False, stateful=True, recurrent_initializer='glorot_uniform')])
print(tf.keras.backend.get_value(model(X)).shape)
# (2, 4)
print(tf.keras.backend.get_value(model(X)))
# [[-0.          0.          0.23865232  0.0057992 ]
#  [-0.07900514  0.07872108  0.06463861  0.29855606]]

So, as the documentation states, if return_sequences is set to False, the model returns only the last output.

As for stateful it is a bit harder to dive into. But essentially, what it does is when having multiple batches of inputs, the last cell state at batch i will be the initial state at batch i+1. However, I think you will be more than fine going with the default settings. ​

Setting and resetting LSTM hidden states in Tensorflow 2, The Keras deep learning library provides an implementation of the You must set return_sequences=True when stacking LSTM layers so that� Long Short-Term Memory layer - Hochreiter 1997. See the Keras RNN API guide for details about the usage of RNN API.. Based on available runtime hardware and constraints, this layer will choose different implementations (cuDNN-based or pure-TensorFlow) to maximize the performance.

Difference Between Return Sequences and Return States for LSTMs , GRU(256, return_sequences=True)) # The output The cell abstraction, together with the generic tf.keras.layers.RNN class, make it You can do this by setting stateful=True in the constructor. If you have a layers.LSTM(64, stateful= True) Because tf.keras.layers.LSTM accepts A 3D tensor, [batch_size, time_steps, num_features]. Even though OP did not mention explicitly the number of features, you take it as one when you define shape=(steps_number,1). – ARAT Oct 14 '19 at 13:21

Recurrent Neural Networks (RNN) with Keras, Recurrent(return_sequences=False, go_backwards=False, stateful=False, All recurrent layers ( LSTM , GRU , SimpleRNN ) also follow the specifications of this If True, process the input sequence backwards. stateful: Boolean (default False ). If set to 2 (LSTM/GRU only), the RNN will combine the input gate, the forget� I have the following Sequence tagging Keras model: model = tf.keras.Sequential([ tf.keras.layers.Dropout(0.5), tf.keras.layers.LSTM(64, return_sequences=True), tf

Recurrent Layers, Sequential() model.add(tf.keras.layers. LSTM(256, return_sequences=True, stateful=True)) Put link here or attach to the issue. -tf.keras.layers.LSTM(rnn_units, -return_sequences=True, -stateful=True, I have tried with unmodified code with GPU setting for predication, its interesting to

Comments
  • Can you cite the tutorial from which the image is taken? My understanding is that stateful=True share the context across batches, not the predictions
  • It is from the same tutorial mentioned in the question: tensorflow.org/tutorials/text/…
  • @rtrtrt well each prediction is made at separate batches, like first batch will have first character for ex 4000 samples, next batch will have second character for next 4000 samples
  • What does it mean "for index i"?
  • @rtrtrt it means the LSTM cell unwrapped at time step i