## why set return_sequences=True and stateful=True for tf.keras.layers.LSTM?

I am learning tensorflow2.0 and follow the tutorial. In the `rnn`

example, I found the code:

def build_model(vocab_size, embedding_dim, rnn_units, batch_size): model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]), tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'), tf.keras.layers.Dense(vocab_size) ]) return model

My question is: why the code set the argument `return_sequences=True`

and `stateful=True`

? How about using the default argument?

The example in the tutorial is about text generation. This is the input that is fed to the network in a batch:

(64, 100, 65) # (batch_size, sequence_length, vocab_size)

`return_sequences=True`

Since the intention is to predict a character for every time step i.e. for every character in the sequence, the next character needs to be predicted.

So, the argument `return_sequences=True`

is set to true, to get an output shape of (64, 100, 65). If this argument is set to False, then only the last output would be returned, so for batch of 64, output would be (64, 65) i.e. for every sequence of 100 characters, only the last predicted character would be returned.

`stateful=True`

From the documentation,
*"If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch."*

In the below diagram from the tutorial, you can see that setting stateful helps the LSTM make better predictions by providing the context of the previous prediction.

**tf.keras.layers.LSTM,** return_state=False, go_backwards=False, stateful=False, time_major=False, unroll=False, **kwargs ) inputs = tf.random.normal([32, 10, 8]) >>> lstm = tf. keras.layers.LSTM(4) LSTM(4, return_sequences=True, return_state=True) >> > whole_seq_output, Setting it to true will also force bias_initializer="zeros" . This is� Stateful Model Training¶. The stateful model gives flexibility of resetting states so you can pass states from batch to batch. However, as a consequence, stateful model requires some book keeping during the training: a set of original time series needs to be trained in the sequential manner and you need to specify when the batch with new sequence starts.

##### Return Sequences

Lets look at a typical model architectures built using LSTMs.

##### Sequence to sequence models:

We feed in a sequence of inputs (x's), one batch at a time and each LSTM cell returns an output (y_i). So if your input is of size `batch_size x time_steps X input_size`

then the LSTM output will be `batch_size X time_steps X output_size`

. This is called a sequence to sequence model because an input sequence is converted into an output sequence. Typical usages of this model are in tagger (POS tagger, NER Tagger). In keras this is achieved by setting `return_sequences=True`

.

##### Sequence classification - Many to one Architecture

In many to one architecture we use output sates of the only the last LSTM cell. This kind of architecture is normally used for classification problems like predicting if a movie review (represented as a sequence of words) is +ve of -ve. In keras if we set `return_sequences=False`

the model returns the output state of only the last LSTM cell.

##### Stateful

An LSTM cell is composed of many gates as show in figure below from this blog post. The states/gates of the previous cell is used to calculate the state of the current cell. In keras if `stateful=False`

then the states are reset after each batch. If `stateful=True`

the states from the previous batch for index `i`

will be used as initial state for index `i`

in the next batch. So state information get propagated between batches with `stateful=True`

. Check this link for explanation of usefulness of statefulness with an example.

**LSTM layer,** RNN( cell, return_state=True, return_sequences=True, stateful=True, ) hidden_state, cell_state = lstm(input_layer) output = tf.keras.layers. I traced the delay down to tensorflow.keras.backend.set_value, so it obviously is really setting of the state that takes extremely long. Describe the expected behavior I would expect that the reset of a single LSTM (one model layer) stays constant independent of the number of LSTMs the model consists of.

Let's see the differences when playing around with the arguments:

tf.keras.backend.clear_session() tf.set_random_seed(42) X = np.array([[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[0,0,0]]], dtype=np.float32) model = tf.keras.Sequential([tf.keras.layers.LSTM(4, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')]) print(tf.keras.backend.get_value(model(X)).shape) # (2, 3, 4) print(tf.keras.backend.get_value(model(X))) # [[[-0.16141939 0.05600287 0.15932009 0.15656665] # [-0.10788933 0. 0.23865232 0.13983202] [-0. 0. 0.23865232 0.0057992 ]] # [[-0.16141939 0.05600287 0.15932009 0.15656665] # [-0.10788933 0. 0.23865232 0.13983202] # [-0.07900514 0.07872108 0.06463861 0.29855606]]]

So, if `return_sequences`

is set to `True`

the model returned the full sequence it predicts.

tf.keras.backend.clear_session() tf.set_random_seed(42) model = tf.keras.Sequential([ tf.keras.layers.LSTM(4, return_sequences=False, stateful=True, recurrent_initializer='glorot_uniform')]) print(tf.keras.backend.get_value(model(X)).shape) # (2, 4) print(tf.keras.backend.get_value(model(X))) # [[-0. 0. 0.23865232 0.0057992 ] # [-0.07900514 0.07872108 0.06463861 0.29855606]]

So, as the documentation states, if `return_sequences`

is set to `False`

, the model returns only the last output.

As for `stateful`

it is a bit harder to dive into. But essentially, what it does is when having multiple batches of inputs, the last cell state at batch `i`

will be the initial state at batch `i+1`

. However, I think you will be more than fine going with the default settings.

**Setting and resetting LSTM hidden states in Tensorflow 2,** The Keras deep learning library provides an implementation of the You must set return_sequences=True when stacking LSTM layers so that� Long Short-Term Memory layer - Hochreiter 1997. See the Keras RNN API guide for details about the usage of RNN API.. Based on available runtime hardware and constraints, this layer will choose different implementations (cuDNN-based or pure-TensorFlow) to maximize the performance.

**Difference Between Return Sequences and Return States for LSTMs ,** GRU(256, return_sequences=True)) # The output The cell abstraction, together with the generic tf.keras.layers.RNN class, make it You can do this by setting stateful=True in the constructor. If you have a layers.LSTM(64, stateful= True) Because tf.keras.layers.LSTM accepts A 3D tensor, [batch_size, time_steps, num_features]. Even though OP did not mention explicitly the number of features, you take it as one when you define shape=(steps_number,1). – ARAT Oct 14 '19 at 13:21

**Recurrent Neural Networks (RNN) with Keras,** Recurrent(return_sequences=False, go_backwards=False, stateful=False, All recurrent layers ( LSTM , GRU , SimpleRNN ) also follow the specifications of this If True, process the input sequence backwards. stateful: Boolean (default False ). If set to 2 (LSTM/GRU only), the RNN will combine the input gate, the forget� I have the following Sequence tagging Keras model: model = tf.keras.Sequential([ tf.keras.layers.Dropout(0.5), tf.keras.layers.LSTM(64, return_sequences=True), tf

**Recurrent Layers,** Sequential() model.add(tf.keras.layers. LSTM(256, return_sequences=True, stateful=True)) Put link here or attach to the issue. -tf.keras.layers.LSTM(rnn_units, -return_sequences=True, -stateful=True, I have tried with unmodified code with GPU setting for predication, its interesting to

##### Comments

- Can you cite the tutorial from which the image is taken? My understanding is that
`stateful=True`

share the context across batches, not the predictions - It is from the same tutorial mentioned in the question: tensorflow.org/tutorials/text/…
- @rtrtrt well each prediction is made at separate batches, like first batch will have first character for ex 4000 samples, next batch will have second character for next 4000 samples
- What does it mean "for index i"?
- @rtrtrt it means the LSTM cell unwrapped at time step
`i`