Hot questions for Using Neural networks in machine translation

Top 10 Python Open Source / Neural networks / machine translation

Question:

First of all, I know this question is kind of off-topic, but I have already tried to ask elsewhere but got no response.

Adding a UNK token to the vocabulary is a conventional way to handle oov words in tasks of NLP. It is totally understandable to have it for encoding, but what's the point to have it for decoding? I mean you would never expect your decoder to generate a UNK token during prediction, right?


Answer:

Depending on how you preprocess your training data, you might need the UNK during training. Even if you use BPE or other subword segmentation, OOV can appear in the training data, usually some weird UTF-8 stuff, fragments of alphabets, you are not interested in at all, etc.

For example, if you take WMT training data for English-German translation, do BPE and take the vocabulary, you vocabulary will contain thousands of Chinese characters that occur exactly once in the training data. Even if you keep them in the vocabulary, the model has no chance to learn anything about them, not even to copy them. It makes sense to represent them as UNKs.

Of course, what you usually do at the inference time is that you prevent the model predict UNK tokens, UNK is always incorrect.

Question:

The code here of the Tensorflow translate.py example confused me. The copied code is:

  # This is a greedy decoder - outputs are just argmaxes of output_logits.
  outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]

Why does the argmax work?

The output_logits's shape is [bucket_length,batch_size,embedding_size]


Answer:

For each logit (or: activation for each word) they take the index where the activation has the highest value of everything.

For the argmax: take a look at the numpy examples on this page: https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html

a = array([[0, 1, 2],
       [3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])

So what output does is:

  • For each word (the length of bucket_length)
    • get the max activation of the embedding_size

You should look at the shape of the resulting outputs array. You will see that because batch_size is 1 it all works out!

Let me know if this helps you!