Hot questions for Using Neural networks in mfcc

Question:

Using librosa, I created mfcc for my audio file as follows:

import librosa
y, sr = librosa.load('myfile.wav')
print y
print sr
mfcc=librosa.feature.mfcc(y=y, sr=sr)

I also have a text file that contains manual annotations[start, stop, tag] corresponding to the audio as follows:

0.0 2.0 sound1 2.0 4.0 sound2 4.0 6.0 silence 6.0 8.0 sound1

QUESTION: How to do I combine the generated mfcc's that was generated by librosa, with the annotations from text file.

Final goal is, I want to combine mfcc corresponding to the label, and pass it to a neural network. So a neural network will have the mfcc and corresponding label as training data.

If it was one dimensional , I could have N columns with N values and the final Column Y with a Class label. But i'm confused how to proceed, as the mfcc has the shape of something like (16, X) or (20, Y). So I don't know how to combine the two.

My sample mfcc's are here : https://gist.github.com/manbharae/0a53f8dfef6055feef1d8912044e1418

Please help thank you.

Update : Objective is to train a neural network so that it can identify a new sound when it encounters it in the future.

I googled and found that mfcc are very good for speech. However my audio has speech but I want to indentify non speech. Are there any other recommended audio features for a general purpose audio classification/recognition task?


Answer:

Try the following. The explanation is included in the code.

import numpy
import librosa

# The following function returns a label index for a point in time (tp)
# this is psuedo code for you to complete
def getLabelIndexForTime(tp):
    # search the loaded annoations for what label corresponsons to the given time
    # convert the label to an index that represents its unqiue value in the set
    # ie.. 'sound1' = 0, 'sound2' = 1, ...
    #print tp  #for debug
    label_index = 0 #replace with logic above
    return label_index


if __name__ == '__main__':
    # Load the waveforms samples and convert to mfcc
    raw_samples, sample_rate = librosa.load('Front_Right.wav')
    mfcc  = librosa.feature.mfcc(y=raw_samples, sr=sample_rate)
    print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate))

    # Create the network's input training data, X
    # mfcc is organized (feature, sample) but the net needs (sample, feature)
    # X is mfcc reorganized to (sample, feature)
    X     = numpy.moveaxis(mfcc, 1, 0)
    print 'mfcc.shape:', mfcc.shape
    print 'X.shape:   ', X.shape

    # Note that 512 samples is the default 'hop_length' used in calculating 
    # the mfcc so each mfcc spans 512/sample_rate seconds.
    mfcc_samples = mfcc.shape[1]
    mfcc_span    = 512/float(sample_rate)
    print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples)

    # for 'n' network input samples, calculate the time point where they occur
    # and get the appropriate label index for them.
    # Use +0.5 to get the middle of the mfcc's point in time.
    Y = []
    for sample_num in xrange(mfcc_samples):
        time_point = (sample_num + 0.5) * mfcc_span
        label_index = getLabelIndexForTime(time_point)
        Y.append(label_index)
    Y = numpy.array(Y)

    # Y now contains the network's output training values
    # !Note for some nets you may need to convert this to one-hot format
    print 'Y.shape:   ', Y.shape
    assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples

    # Train the net with something like...
    # model.fit(X, Y, ...   #ie.. for a Keras NN model

I should mention that here the Y data is intended to be used in a network that has a softmax output that can be trained with integer label data. Keras models accept this with a sparse_categorical_crossentropy loss function (I believe the loss function internally converts it to one-hot encoding). Other frameworks require the Y training labels to be delivered alreading in one-hot encoding format. This is more common. There's lots of examples on how to do the conversion. For your case you could do something like...

Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32')
for i, val in enumerate(Y):
    Yoh[i, val] = 1.0

As for mfcc's being acceptable for classifying non-speech, I would expect them to work but you may want to try modifying their parameters, ie.. librosa allows you do something like n_mfcc=40 so you get 40 features instead of just 20. For fun, you might try replacing the mfcc with a simple FFT of the same size (512 samples) and see which works the best.

Question:

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.

def parser(row):
   # function to load files and extract features
   file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')

   # handle exception to check if there isn't a file which is corrupted
   try:
      # here kaiser_fast is a technique used for faster extraction
      X, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
      # we extract mfcc feature from data
      mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0) 
   except Exception as e:
      print("Error encountered while parsing file: ", file)
      return None, None
 
   feature = mfccs
   label = row.Class
 
   return [feature, label]

temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']

Answer:

Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.

So your temp is actually a Series object. Then the following line dont have any effect:

temp.columns = ['feature', 'label']

Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.

Change your parser() method as following:

def parser(row):
    ...
    ...
    ...

    # Return pandas.Series instead of List
    return pd.Series([feature, label])

By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.

I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Question:

I'm doing my final project on the first degree studies. In short i'm taking 772 training sound files that each sound file have 327 sound features coeffs called mfcc, so my x_training input is - 772*327.

I asked recomended to me what model to use and I was answeared -

Try CNN on MFCC (add 4 or so CNN layers followed by Max Pooling) -> Flatten -> Dense Layers This is a very generic architecture that works for most tasks of this nature – Iordanis 2 days ago

So I tried to create it using tensorflow -

model = tf.keras.models.Sequential([
     tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])),
     tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
     tf.keras.layers.Flatten(),
     tf.keras.layers.Dense(32, activation='relu')
])

( the integers values are completly random )

when x_train.shape[1:] is 327 ( number of mfcc coeffs in each sound file )

but unfortuentlly it didn't work for me and it writes -

ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=2. Full shape received: [None,
312]

I tried to down the convulution layer to 1D but it also didn't work ( just changed the error to excepted 3d instead 4d )

Someone know what should I do...

Sorry for my english and sorry if it stupid question, I'm pretty new to tensorflow :)

Edit :

I did the following things but now it's write me that:

TypeError: Error converting shape to a TensorShape: int() argument must be a string, a bytes-like object or a number, not 'tuple' on the dense layer

x_train.reshape((-1, 1))
x_test.reshape((-1, 1))
model = tf.keras.models.Sequential([
     tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:], 1)),
     tf.keras.layers.MaxPooling1D(pool_size=2),
     tf.keras.layers.Flatten(),
     tf.keras.layers.Dense(32, activation='relu'),

I also did it :

x_train.reshape((-1, 1))
x_test.reshape((-1, 1))
model = tf.keras.models.Sequential([
     tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])),
     tf.keras.layers.MaxPooling1D(pool_size=2),
     tf.keras.layers.Flatten(),
     tf.keras.layers.Dense(32, activation='relu'),

but got the error before -

ValueError: Input 0 of layer conv1d is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 312]


Answer:

Since your training data has only one feature-dimension, use Conv1D instead of Conv2D. Then your input has a 3d shape like (?, x, 1), where the first dimension will be the batch-size, the second one are the features, and the last one contains the values itself. So try to reshape your data first via

x_train = x_train.reshape(np.append(x_train.shape, 1))

and input_shape=(x_train.shape[1:]) should work fine.

Please note that you also have to change your pooling to MaxPooling1D afterwards!