Hot questions for Using Neural networks in mfcc


Using librosa, I created mfcc for my audio file as follows:

import librosa
y, sr = librosa.load('myfile.wav')
print y
print sr
mfcc=librosa.feature.mfcc(y=y, sr=sr)

I also have a text file that contains manual annotations[start, stop, tag] corresponding to the audio as follows:

0.0 2.0 sound1 2.0 4.0 sound2 4.0 6.0 silence 6.0 8.0 sound1

QUESTION: How to do I combine the generated mfcc's that was generated by librosa, with the annotations from text file.

Final goal is, I want to combine mfcc corresponding to the label, and pass it to a neural network. So a neural network will have the mfcc and corresponding label as training data.

If it was one dimensional , I could have N columns with N values and the final Column Y with a Class label. But i'm confused how to proceed, as the mfcc has the shape of something like (16, X) or (20, Y). So I don't know how to combine the two.

My sample mfcc's are here :

Please help thank you.

Update : Objective is to train a neural network so that it can identify a new sound when it encounters it in the future.

I googled and found that mfcc are very good for speech. However my audio has speech but I want to indentify non speech. Are there any other recommended audio features for a general purpose audio classification/recognition task?


Try the following. The explanation is included in the code.

import numpy
import librosa

# The following function returns a label index for a point in time (tp)
# this is psuedo code for you to complete
def getLabelIndexForTime(tp):
    # search the loaded annoations for what label corresponsons to the given time
    # convert the label to an index that represents its unqiue value in the set
    # ie.. 'sound1' = 0, 'sound2' = 1, ...
    #print tp  #for debug
    label_index = 0 #replace with logic above
    return label_index

if __name__ == '__main__':
    # Load the waveforms samples and convert to mfcc
    raw_samples, sample_rate = librosa.load('Front_Right.wav')
    mfcc  = librosa.feature.mfcc(y=raw_samples, sr=sample_rate)
    print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate))

    # Create the network's input training data, X
    # mfcc is organized (feature, sample) but the net needs (sample, feature)
    # X is mfcc reorganized to (sample, feature)
    X     = numpy.moveaxis(mfcc, 1, 0)
    print 'mfcc.shape:', mfcc.shape
    print 'X.shape:   ', X.shape

    # Note that 512 samples is the default 'hop_length' used in calculating 
    # the mfcc so each mfcc spans 512/sample_rate seconds.
    mfcc_samples = mfcc.shape[1]
    mfcc_span    = 512/float(sample_rate)
    print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples)

    # for 'n' network input samples, calculate the time point where they occur
    # and get the appropriate label index for them.
    # Use +0.5 to get the middle of the mfcc's point in time.
    Y = []
    for sample_num in xrange(mfcc_samples):
        time_point = (sample_num + 0.5) * mfcc_span
        label_index = getLabelIndexForTime(time_point)
    Y = numpy.array(Y)

    # Y now contains the network's output training values
    # !Note for some nets you may need to convert this to one-hot format
    print 'Y.shape:   ', Y.shape
    assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples

    # Train the net with something like...
    #, Y, ...   #ie.. for a Keras NN model

I should mention that here the Y data is intended to be used in a network that has a softmax output that can be trained with integer label data. Keras models accept this with a sparse_categorical_crossentropy loss function (I believe the loss function internally converts it to one-hot encoding). Other frameworks require the Y training labels to be delivered alreading in one-hot encoding format. This is more common. There's lots of examples on how to do the conversion. For your case you could do something like...

Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32')
for i, val in enumerate(Y):
    Yoh[i, val] = 1.0

As for mfcc's being acceptable for classifying non-speech, I would expect them to work but you may want to try modifying their parameters, ie.. librosa allows you do something like n_mfcc=40 so you get 40 features instead of just 20. For fun, you might try replacing the mfcc with a simple FFT of the same size (512 samples) and see which works the best.


I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.

def parser(row):
   # function to load files and extract features
   file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')

   # handle exception to check if there isn't a file which is corrupted
      # here kaiser_fast is a technique used for faster extraction
      X, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
      # we extract mfcc feature from data
      mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0) 
   except Exception as e:
      print("Error encountered while parsing file: ", file)
      return None, None
   feature = mfccs
   label = row.Class
   return [feature, label]

temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']


Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.

So your temp is actually a Series object. Then the following line dont have any effect:

temp.columns = ['feature', 'label']

Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.

Change your parser() method as following:

def parser(row):

    # Return pandas.Series instead of List
    return pd.Series([feature, label])

By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.

I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.


I'm doing my final project on the first degree studies. In short i'm taking 772 training sound files that each sound file have 327 sound features coeffs called mfcc, so my x_training input is - 772*327.

I asked recomended to me what model to use and I was answeared -

Try CNN on MFCC (add 4 or so CNN layers followed by Max Pooling) -> Flatten -> Dense Layers This is a very generic architecture that works for most tasks of this nature – Iordanis 2 days ago

So I tried to create it using tensorflow -

model = tf.keras.models.Sequential([
     tf.keras.layers.Conv2D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])),
     tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
     tf.keras.layers.Dense(32, activation='relu')

( the integers values are completly random )

when x_train.shape[1:] is 327 ( number of mfcc coeffs in each sound file )

but unfortuentlly it didn't work for me and it writes -

ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=2. Full shape received: [None,

I tried to down the convulution layer to 1D but it also didn't work ( just changed the error to excepted 3d instead 4d )

Someone know what should I do...

Sorry for my english and sorry if it stupid question, I'm pretty new to tensorflow :)

Edit :

I did the following things but now it's write me that:

TypeError: Error converting shape to a TensorShape: int() argument must be a string, a bytes-like object or a number, not 'tuple' on the dense layer

x_train.reshape((-1, 1))
x_test.reshape((-1, 1))
model = tf.keras.models.Sequential([
     tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:], 1)),
     tf.keras.layers.Dense(32, activation='relu'),

I also did it :

x_train.reshape((-1, 1))
x_test.reshape((-1, 1))
model = tf.keras.models.Sequential([
     tf.keras.layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])),
     tf.keras.layers.Dense(32, activation='relu'),

but got the error before -

ValueError: Input 0 of layer conv1d is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 312]


Since your training data has only one feature-dimension, use Conv1D instead of Conv2D. Then your input has a 3d shape like (?, x, 1), where the first dimension will be the batch-size, the second one are the features, and the last one contains the values itself. So try to reshape your data first via

x_train = x_train.reshape(np.append(x_train.shape, 1))

and input_shape=(x_train.shape[1:]) should work fine.

Please note that you also have to change your pooling to MaxPooling1D afterwards!