Hot questions for Using Neural networks in speech recognition

Question:

I've red a few paper about speech recognition based on neural networks, the gaussian mixture model and the hidden markov model. On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details. I really would appreciate, if someone could enlighten me.

As I understand it, the procedure consists of three elements:

  1. Input The audio stream gets split up by frames of 10ms and processed by a MFCC, which outputs a feature vector.

  2. DNN The neural network gets the feature vector as a input and processes the features, so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.

  3. HMM The HMM is a is a state model, in which each state represents a tri-phone. Each state has a number of probability for changing to all the other state. Now the output layer of the DNN produces a feature vector, that tells the current state to which state it has to change next.

What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state. And how is the HMM created in the first place? Where do I get all the Information about the probabilietes?

I don't need to understand every detail, the basic concept is sufficient for my purpose. I just need to assure, that my basic thinking about the process is right.


Answer:

On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.

It is better to read a textbook, not a research paper.

so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.

This sentence does not have clear meaning which means you are not quite sure yourself. DNN takes a frame features and produces the probabilities for the states.

HMM The HMM is a is a state model, in which each state represents a tri-phone.

Not necessary a triphone. Usually there are tied triphones which means several triphones correspond to certain state.

Now the output layer of the DNN produces a feature vector

No, DNN produces state probabilities for the current frame, it does not produce feature vector.

that tells the current state to which state it has to change next.

No, next state is selected by HMM Viterbi algorithm based on current state and DNN probabilities. DNN alone does not decide the next state.

What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.

Output layer produces probabilities. It says that phone A at this frame is probable with probability 0.9 and phone B in this frame is probable with probability 0.1

And how is the HMM created in the first place?

Unlike end-to-end systems which does not use HMM, HMM is usually trained with HMM/GMM system and Baum-Welch algorithm before DNN is initialized. So you first train GMM/HMM with Baum-Welch, then you train the DNN to improve GMM.

Where do I get all the Information about the probabilietes?

It is hard to understand your last question.

Question:

I am building a speech to text system with N sample sentences using Hidden Markov Models for re-estimation. In the context of Neural Networks, I understand that the concept of epoch refers to a complete training cycle. I assume this means "feeding the same data to the same, updating network which has different weights and biases every time" - Correct me if I am wrong.

Would the same logic work while performing re-estimation (i.e. training) of HMMs from the same sentences ? In other words, if I have N sentences, can I repeat the input samples 10 times each to generate 10 * N samples. Does this mean I am performing 10 epochs on HMMs ? Furthermore, Does this actually help obtain better results?

From this paper, I get the impression that epoch in the context of HMMs refers to a unit of time :

Counts represent a device-specific numeric quantity which is generated by an accelerometer for a specific time unit (epoch) (e.g. 1 to 60 sec).

If not a unit of time, epoch at the very least sounds different. In the end, I would like to know :

  • What is epoch in the context of HMMs?
  • How is it different from epoch in Neural Networks?
  • Considering the definition of epoch as training cycles, would multiple epochs improve re-estimation of HMMs ?

Answer:

What is epoch in the context of HMMs?

Same as in neural networks, a round of processing the whole dataset.

How is it different from epoch in Neural Networks?

There are no differences except the term "epoch" is not very widely used for HMM. People just call it "iteration".

From this paper, I get the impression that epoch in the context of HMMs refers to a unit of time

"Epoch" in this paper is not related to HMM context at all, it is a separate idea specific to that paper, you should not generalize the term usage from the paper.

Considering the definition of epoch as training cycles, would multiple epochs improve re-estimation of HMMs?

There is no such thing such as multiple epochs improve re-estimation neither for neural networks nor for HMM. Each epoch improves the accuracy up to certain point, then overtraining happens and validation error starts to grow and training error continues to zero. There is an optimal number of iterations usually depending on the model architecture. HMM model usually has less parameters and less prone to overtraining, so extra epochs are not that harmful. Still, there is a number of epochs you need to perform optimally.

In speech recognition it is usually 6-7 iterations of the Baum-Welch algorithm. Less epochs give you less accurate model, more epochs could lead to overtraining or simply do not improve anything.