Hot questions for Using Neural networks in seq2seq


I'm trying to understand the seq2seq models defined in in tensorflow. I use bits of code I copy from the example that comes with tensorflow. I keep getting the same error and really do not understand where it comes from.

A minimal code example to reproduce the error:

import tensorflow as tf
from tensorflow.models.rnn import rnn_cell
from tensorflow.models.rnn import seq2seq

encoder_inputs = []
decoder_inputs = []
for i in xrange(350):  
    encoder_inputs.append(tf.placeholder(tf.int32, shape=[None],

for i in xrange(45):
    decoder_inputs.append(tf.placeholder(tf.int32, shape=[None],

model = seq2seq.basic_rnn_seq2seq(encoder_inputs,

The error I get when evaluating the last line (I evaluated it interactively in the python interpreter):

    >>>  Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/tmp/py1053173el", line 12, in <module>
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/", line 82, in basic_rnn_seq2seq
        _, enc_states = rnn.rnn(cell, encoder_inputs, dtype=dtype)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/", line 85, in rnn
        output_state = cell(input_, state)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/", line 161, in __call__
        concat = linear.linear([inputs, h], 4 * self._num_units, True)
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/", line 32, in linear
        raise ValueError("Linear is expecting 2D arguments: %s" % str(shapes))
    ValueError: Linear is expecting 2D arguments: [[None], [None, 512]]

I suspect the error comes from my side :) On a sidenote. The documentation and the tutorials are really great but the example code for the sequence to sequence model (the english to french translation example) is quite dense. You also have to jump a lot between files to understand what's going on. Me at least got lost several times in the code.

A minimal example (perhaps on some toy data) of constructing and training a basic seq2seq model would really be helpful here. Somebody know if this already exist somewhere?

EDIT I have fixed the code above according @Ishamael suggestions (meaning, no errors returns) (see below), but there are still some things not clear in this fixed version. My input is a sequence of vectors of length 2 of real valued values. And my output is a sequence of binary vectors of length 22. Should my tf.placeholder code not look like the following? (EDIT yes)

tf.placeholder(tf.float32, shape=[None,2],name="encoder{0}".format(i))
tf.placeholder(tf.float32, shape=[None,22],name="encoder{0}".format(i))

I also had to change tf.int32 to tf.float32 above. Since my output is binary. Should this not be tf.int32 for the tf.placeholder of my decoder? But tensorflow complains again if I do this. I'm not sure what the reasoning is behind this.

The size of my hidden layer is 512 here.

the complete fixed code

import tensorflow as tf
from tensorflow.models.rnn import rnn_cell
from tensorflow.models.rnn import seq2seq

encoder_inputs = []
decoder_inputs = []
for i in xrange(350):  
    encoder_inputs.append(tf.placeholder(tf.float32, shape=[None,512],

for i in xrange(45):
    decoder_inputs.append(tf.placeholder(tf.float32, shape=[None,512],

model = seq2seq.basic_rnn_seq2seq(encoder_inputs,


Most of the models (seq2seq is not an exception) expect their input to be in batches, so if the shape of your logical input is [n], then a shape of a tensor you will be using as an input to your model should be [batch_size x n]. In practice the first dimension of the shape is usually left out as None and inferred to be the batch size at runtime.

Since the logical input to seq2seq is a vector of numbers, the actual tensor shape should be [None, input_sequence_length]. So fixed code would look along the lines of:

input_sequence_length = 2; # the length of one vector in your input sequence

for i in xrange(350):  
    encoder_inputs.append(tf.placeholder(tf.int32, shape=[None, input_sequence_length],

(and then the same for the decoder)


I am trying to implement a seq2seq model in Pytorch and I am having some problem with the batching. For example I have a batch of data whose dimensions are

[batch_size, sequence_lengths, encoding_dimension]

where the sequence lengths are different for each example in the batch.

Now, I managed to do the encoding part by padding each element in the batch to the length of the longest sequence.

This way if I give as input to my net a batch with the same shape as said, I get the following outputs:

output, of shape [batch_size, sequence_lengths, hidden_layer_dimension]

hidden state, of shape [batch_size, hidden_layer_dimension]

cell state, of shape [batch_size, hidden_layer_dimension]

Now, from the output, I take for each sequence the last relevant element, that is the element along the sequence_lengths dimension corresponding to the last non padded element of the sequence. Thus the final output I get is of shape [batch_size, hidden_layer_dimension].

But now I have the problem of decoding it from this vector. How do I handle a decoding of sequences of different lengths in the same batch? I tried to google it and found this, but they don't seem to address the problem. I thought of doing element by element for the whole batch, but then I have the problem to pass the initial hidden states, given that the ones from the encoder will be of shape [batch_size, hidden_layer_dimension], while the ones from the decoder will be of shape [1, hidden_layer_dimension].

Am I missing something? Thanks for the help!


You are not missing anything. I can help you since I have worked on several sequence-to-sequence application using PyTorch. I am giving you a simple example below.

class Seq2Seq(nn.Module):
    """A Seq2seq network trained on predicting the next query."""

    def __init__(self, dictionary, embedding_index, args):
        super(Seq2Seq, self).__init__()

        self.config = args
        self.num_directions = 2 if self.config.bidirection else 1

        self.embedding = EmbeddingLayer(len(dictionary), self.config)
        self.embedding.init_embedding_weights(dictionary, embedding_index, self.config.emsize)

        self.encoder = Encoder(self.config.emsize, self.config.nhid_enc, self.config.bidirection, self.config)
        self.decoder = Decoder(self.config.emsize, self.config.nhid_enc * self.num_directions, len(dictionary),

    def compute_decoding_loss(logits, target, seq_idx, length):
        losses = -torch.gather(logits, dim=1, index=target.unsqueeze(1)).squeeze()
        mask = helper.mask(length, seq_idx)  # mask: batch x 1
        losses = losses * mask.float()
        num_non_zero_elem = torch.nonzero(
        if not num_non_zero_elem:
        return losses.sum(), 0 if not num_non_zero_elem else losses.sum(), num_non_zero_elem[0]

    def forward(self, q1_var, q1_len, q2_var, q2_len):
        # encode the query
        embedded_q1 = self.embedding(q1_var)
        encoded_q1, hidden = self.encoder(embedded_q1, q1_len)

        if self.config.bidirection:
            if self.config.model == 'LSTM':
                h_t, c_t = hidden[0][-2:], hidden[1][-2:]
                decoder_hidden =[0].unsqueeze(0), h_t[1].unsqueeze(0)), 2),
                    (c_t[0].unsqueeze(0), c_t[1].unsqueeze(0)), 2)
                h_t = hidden[0][-2:]
                decoder_hidden =[0].unsqueeze(0), h_t[1].unsqueeze(0)), 2)
            if self.config.model == 'LSTM':
                decoder_hidden = hidden[0][-1], hidden[1][-1]
                decoder_hidden = hidden[-1]

        decoding_loss, total_local_decoding_loss_element = 0, 0
        for idx in range(q2_var.size(1) - 1):
            input_variable = q2_var[:, idx]
            embedded_decoder_input = self.embedding(input_variable).unsqueeze(1)
            decoder_output, decoder_hidden = self.decoder(embedded_decoder_input, decoder_hidden)
            local_loss, num_local_loss = self.compute_decoding_loss(decoder_output, q2_var[:, idx + 1], idx, q2_len)
            decoding_loss += local_loss
            total_local_decoding_loss_element += num_local_loss

        if total_local_decoding_loss_element > 0:
            decoding_loss = decoding_loss / total_local_decoding_loss_element

        return decoding_loss

You can see the complete source code here. This application is about predicting users' next web-search query given the current web-search query.

The answerer to your question:

How do I handle a decoding of sequences of different lengths in the same batch?

You have padded sequences, so you can consider as all the sequences are of the same length. But when you are computing loss, you need to ignore loss for those padded terms using masking.

I have used a masking technique to achieve the same in the above example.

Also, you are absolutely correct on: you need to decode element by element for the mini-batches. The initial decoder state [batch_size, hidden_layer_dimension] is also fine. You just need to unsqueeze it at dimension 0, to make it [1, batch_size, hidden_layer_dimension].

Please note, you do not need to loop over each example in the batch, you can execute the whole batch at a time, but you need to loop over the elements of the sequences.


I build a seq2seq model using the library provided with tensorflow. Before training anything I wanted to visualize the graph network of my untrained model in tensorboard, but it does not want to display this.

Below a minimal example to reproduce my problem. Anybody an idea why this does not work? Can you only visualize a grap of a model after it has been trained?

import tensorflow as tf
import numpy as np
from tensorflow.models.rnn import rnn_cell
from tensorflow.models.rnn import seq2seq

encoder_inputs = []
decoder_inputs = []

for i in xrange(350):  
    encoder_inputs.append(tf.placeholder(tf.float32, shape=[None,2],

for i in xrange(45):
    decoder_inputs.append(tf.placeholder(tf.float32, shape=[None,22],

size = 512 # number of hidden units
num_layers = 2 # Number of LSTMs
single_cell = rnn_cell.BasicLSTMCell(size)
cell = rnn_cell.MultiRNNCell([single_cell] * num_layers)
model = seq2seq.basic_rnn_seq2seq(encoder_inputs, decoder_inputs,cell)

sess = tf.Session()
summary_writer = tf.train.SummaryWriter('/path/to/log', graph_def = sess.graph_def)


It looks like this might be related to a bug where the graph visualization does not work in the firefox browser. Try using chrome or safari if possible.


If I have a string, say "abc" and target of that string in reverse, say "cba".

Can a neural network, in particular an encoder-decoder model, learn this mapping? If so, what is the best model to accomplish this.

I ask, as this is a structural translation rather than a simple character mapping as in normal machine translation


I doubt that a NN will learn the abstract structural transformation. Since the string is of unbounded input length, the finite NN won't have the info necessary. NLP processes generally work with identifying small blocks and simple context-sensitive shifts. I don't think they'd identify the end-to-end swaps needed.

However, I expect that an image processor, adapted to a single dimension, would learn this quite quickly. Some can learn how to rotate a sub-image.


My tgt tensor is in shape of [12, 32, 1] which is sequence_length, batch_size, token_idx.

What is the best way to create a mask which has ones for entries with <eos> and before in sequence, and zeros afterwards?

Currently I'm calculating my mask like this, which simply puts zeros where <blank> is, ones otherwise.

mask = torch.zeros_like(tgt).masked_scatter_((tgt != tgt_padding), torch.ones_like(tgt))

But the problem is, that my tgt can contain <blank> as well (before <eos>), in which cases I don't want to mask it out.

My temporary solution:

mask = torch.ones_like(tgt)
for eos_token in (tgt == tgt_eos).nonzero():
    mask[eos_token[0]+1:,eos_token[1]] = 0


I guess you are trying to create a mask for the PAD tokens. There are several ways. One of them is as follows.

# tensor is of shape [seq_len, batch_size, 1]
tensor = tensor.mul(

Here, PAD stands for the index of the PAD_TOKEN. will create a byte tensor where at PAD_TOKEN positions, 0 will be assigned and 1 elsewhere.

If you have examples like, "<s> I think <pad> so </s> <pad> <pad>". Then, I would suggest using different PAD tokens, for before and after </s>.

OR, if you have the length information for each sentence (in the above example, the sentence length is 6), then you can create the mask using the following function.

def sequence_mask(lengths, max_len=None):
    Creates a boolean mask from sequence lengths.
    :param lengths: 1d tensor [batch_size]
    :param max_len: int
    batch_size = lengths.numel()
    max_len = max_len or lengths.max()
    return (torch.arange(0, max_len, device=lengths.device)  # (0 for pad positions)
            .repeat(batch_size, 1)


I am building a model for topic classification and trying to use seq2seq for the model layer, but when I implement this it causes ValueError

"ValueError: Error when checking input: expected input_4 to have 3 dimensions, but got array with shape (160980, 15)".

Dose anyone know what it is? Cause I only have two dimensions input data (201225, 15) and label (201225, 41). Don't know why it needs three dimensions. Here is the code

from keras.models import Sequential, save_model
from keras.layers import Dense, Input, Flatten, Embedding, Dropout, Conv1D, 
MaxPooling1D, GlobalMaxPooling1D, LSTM
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras.backend as K
from keras.utils import plot_model
from keras.layers.wrappers import TimeDistributed

from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import seaborn as sns
from pandas import Series

import seq2seq
from seq2seq.models import SimpleSeq2Seq

# load data
texts = open('c:\\Users/KW198/Documents/topic_model/keywords.txt', 
all_labels = open('c:\\Users/KW198/Documents/topic_model/topics.txt', 

# Tokenlize data
tok = Tokenizer()
sequences = tok.texts_to_sequences(texts)
word_index = tok.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Found 341826 unique tokens. Shape of data tensor: (201225, 15) Shape of label tensor: (201225, 41)


model = SimpleSeq2Seq(input_dim=15, hidden_dim=10, output_length=41, 

#plot_model(model, to_file='model.png',show_shapes=True)

checkporint = EarlyStopping(monitor='val_acc', patience=5, mode='max', 
min_delta=0.003), y_train, epochs=13, batch_size=128, verbose=1, 
validation_split=0.2, callbacks=[checkporint])

score = model.evaluate(x_test, y_test, batch_size=128, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

Here is an error message

ValueError                                Traceback (most recent call last)
<ipython-input-31-fbd441ff95e2> in <module>()
      6 checkporint = EarlyStopping(monitor='val_acc', patience=5, 
      mode='max',  min_delta=0.003)
----> 7, y_train, epochs=13, batch_size=128, verbose=1, 
validation_split=0.2, callbacks=[checkporint])

~\Anaconda3\envs\ztdl\lib\site-packages\keras\engine\ in 
fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, 
validation_data, shuffle, class_weight, sample_weight, initial_epoch, 
   1427             class_weight=class_weight,
   1428             check_batch_axis=False,
-> 1429             batch_size=batch_size)
   1430         # Prepare validation data.
   1431         if validation_data:

~\Anaconda3\envs\ztdl\lib\site-packages\keras\engine\ in 
_standardize_user_data(self, x, y, sample_weight, class_weight, 
check_batch_axis, batch_size)
   1303                                     self._feed_input_shapes,
   1304                                     check_batch_axis=False,
-> 1305                                     exception_prefix='input')
   1306         y = _standardize_input_data(y, self._feed_output_names,
   1307                                     output_shapes,

~\Anaconda3\envs\ztdl\lib\site-packages\keras\engine\ in 
_standardize_input_data(data, names, shapes, check_batch_axis, 
    125                                  ' to have ' + str(len(shapes[i])) +
    126                                  ' dimensions, but got array with 
shape ' +
--> 127                                  str(array.shape))
    128             for j, (dim, ref_dim) in enumerate(zip(array.shape, 
    129                 if not j and not check_batch_axis:

ValueError: Error when checking input: expected input_4 to have 3 
dimensions, but got array with shape (160980, 15)


Input should be a 3 dimensions tensor where the dimensions represents (batch_size, input_length, input_dim)

So in you case if the length of your sequence is 15 and the input dimension is 1 you should reshape your inputs to (?, 15, 1).

If your sequences of words have a fixed length (e.g. 15), then you should use the input_length=15 argument.


In my understand, the first input of the decoder for seq2seq model is the start token. But when I read the code from TrainingHelper in tensorflow/contrib/seq2seq/python/ops/, I found it just return the first token of the target tokens as the first token:

  def initialize(self, name=None):
    with ops.name_scope(name, "TrainingHelperInitialize"):
      finished = math_ops.equal(0, self._sequence_length)
      all_finished = math_ops.reduce_all(finished)
      next_inputs = control_flow_ops.cond(
          all_finished, lambda: self._zero_inputs,
          lambda: nest.map_structure(lambda inp:, self._input_tas))
      return (finished, next_inputs)

Is that right?


Em... I have worked around nlp many times including seq2seq translation. But I have never heard about start token but only end token(EOF).

Although my seq2seq task worked well without anything like start token, I'm not sure if it is a new technique. If it is, thank you let me know.