Hot questions for Using Neural networks in transformer

Question:

I am searching the web for a couple of days for any text generation model that would use only attention mechanisms.

The Transformer architecture that made waves in the context of Seq-to-Seq models is actually based solely on Attention mechanisms but is mainly designed and used for translation or chat bot tasks so it doesn't fit to the purpose, but the principle does.

My question is:

Does anyone knows or heard of a text generation model based solely on Attention without any recurrence?

Thanks a lot!

P.S. I'm familiar with PyTorch.


Answer:

Building a character-level self-attentive model is a challenging task. Character-level models are usually based on RNNs. Whereas in a word/subword model, it is clear from the beginning what are the units carrying meaning (and therefore the units the attention mechanism can attend to), a character-level model needs to learn word meaning in the following layers. This makes it quite difficult for the model to learn.

Text generation models are nothing more than conditional languages model. Google AI recently published a paper on Transformer character language model, but it is the only work I know of.

Anyway, you should consider either using subwords units (as BPE, SentencePiece) or if you really need to go for character level, use RNNs instead.

Question:

I'm currently trying to extend a model that is based on FairSeq/PyTorch. During training I need to train two encoders: one with the target sample, and the original one with the source sample.

So the current forward function looks like this:

def forward(self, src_tokens=None, src_lengths=None, prev_output_tokens=None, **kwargs):
    encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
    decoder_out = self.decoder(prev_output_tokens, encoder_out=encoder_out, **kwargs)
    return decoder_out

And based on this this idea i want something like this:

def forward_test(self, src_tokens=None, src_lengths=None, prev_output_tokens=None, **kwargs):
    encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
    decoder_out = self.decoder(prev_output_tokens, encoder_out=encoder_out, **kwargs)
    return decoder_out

def forward_train(self, src_tokens=None, src_lengths=None, prev_output_tokens=None, **kwargs):
    encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
    autoencoder_out = self.encoder(tgt_tokens, src_lengths=src_lengths, **kwargs)
    concat = some_concatination_func(encoder_out, autoencoder_out)
    decoder_out = self.decoder(prev_output_tokens, encoder_out=concat, **kwargs)
    return decoder_out

Is there any way to do this?

Edit: These are the constraints that I have, since I need to extend FairseqEncoderDecoderModel:

@register_model('transformer_mass')
class TransformerMASSModel(FairseqEncoderDecoderModel):
    def __init__(self, encoder, decoder):
        super().__init__(encoder, decoder) 

Edit 2: The parameters passed to the forward function in Fairseq can be altered by implementing your own Criterion, see for example CrossEntropyCriterion, where sample['net_input'] is passed to the __call__ function of the model, which invokes the forward method.


Answer:

First of all you should always use and define forward not some other methods that you call on the torch.nn.Module instance.

Definitely do not overload eval() as shown by trsvchn as it's evaluation method defined by PyTorch (see here). This method allows layers inside your model to be put into evaluation mode (e.g. specific changes to layers like inference mode for Dropout or BatchNorm).

Furthermore you should call it with __call__ magic method. Why? Because hooks and other PyTorch specific stuff is registered that way properly.

Secondly, do not use some external mode string variable as suggested by @Anant Mittal. That's what train variable in PyTorch is for, it's standard to differentiate by it whether model is in eval mode or train mode.

That being said you are the best off doing it like this:

import torch


class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        ...

    # You could split it into two functions but both should be called by forward
    def forward(
        self, src_tokens=None, src_lengths=None, prev_output_tokens=None, **kwargs
    ):
        encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
        if self.train:
            return self.decoder(prev_output_tokens, encoder_out=encoder_out, **kwargs)
        autoencoder_out = self.encoder(tgt_tokens, src_lengths=src_lengths, **kwargs)
        concat = some_concatination_func(encoder_out, autoencoder_out)
        return self.decoder(prev_output_tokens, encoder_out=concat, **kwargs)

You could (and arguably should) split the above into two separate methods, but that's not too bad as the function is rather short and readable that way. Just stick to PyTorch's way of handling things if easily possible and not some ad-hoc solutions. And no, there will be no problem with backpropagation, why would there be one?