Hot questions for Using Neural networks in audio


I am a beginner at Deep Learning and am attempting to practice the implementation of Neural Networks in Python by performing audio analysis on a dataset. I have been following the Urban Sound Challenge tutorial and have completed the code for training the model, but I keep running into errors when trying to run the model on the test set.

Here is my code for creation of the model and training:

import numpy as np
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten

num_labels = y.shape[1]
filter_size = 2

model = Sequential()

model.add(Dense(256, input_shape = (40,)))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam'), y, batch_size=32, epochs=40, validation_data=(val_X, val_Y))

Running model.summary() before fitting the model gives me:

Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 256)               10496     
activation_3 (Activation)    (None, 256)               0         
dropout_2 (Dropout)          (None, 256)               0         
dense_4 (Dense)              (None, 10)                2570      
activation_4 (Activation)    (None, 10)                0         
Total params: 13,066
Trainable params: 13,066
Non-trainable params: 0

After fitting the model, I attempt to run it on one file so that it can classify the sound.

file_name = ".../UrbanSoundClassifier/test/Test/5.wav"
test_X, sample_rate = librosa.load(file_name,res_type='kaiser_fast')
mfccs = np.mean(librosa.feature.mfcc(y=test_X, sr=sample_rate, n_mfcc=40).T,axis=0)
test_X = np.array(mfccs)

However, I get

ValueError: Error when checking : expected dense_3_input to have shape  

(None, 40) but got array with shape (40, 1)

Would someone kindly like to point me in the right direction as to how I should be testing the model? I do not know what the input for model.predict() should be.

Full code can be found here.



  1. The easiest fix to that is simply reshaping test_x:

    test_x = test_x.reshape((1, 40))
  2. More sophisticated is to reuse the pipeline you have for the creation of train and valid set also for a test set. Please, notice that the process you applied to data files is totally different in case of test. I'd create a test dataframe:

    test_dataframe = pd.DataFrame({'filename': ["here path to test file"]}

    and then reused existing pipeline for creation of validation set.


I want to create a basic convolutional autoencoder in Keras (tensorflow, python) for use on audio (MP3, WAV, etc.) files.

Basically, here's what I'm doing:

1) convert an mp3 into an array

    def mp3_to_array(original_mp3):
        blah blah blah
        return original_array

2) run array through autoencoder, output a similar (but lossy, because of the autoencoder operations) array

    def autoencoder(original_array):
        autoencoder stuff
        return new_array

3) convert an array into an mp3

    def array_to_mp3(new_array):
        halb halb halb
        return new_mp3

I know that Mel Spectrograms and Mel-frequency cepstral coefficients (mfcc's) are commonly used in classification systems. As far as I know, I can't use these, because they can't be converted back to mp3's without significant loss.

Is there an array-based, lossless * (or nearly lossless), representational conversion method that's suitable for use in a convolutional neural network, to convert an mp3 to array, and vice versa?

EDIT: Specifically, I'm asking about steps 1 and 3. I'm aware step 2 will be inherently lossy.

Thanks in advance!


I would say this is less of a question about raw audio representation and more a question of is there a lossless convolutional transformation to which I would say no

... as an aside there are plenty of transformations which are lossless (or nearly so) for example when you send audio into a Fourier Transform to convert it from the time domain into its frequency domain representation then perform a second transformation by sending the freq domain representation into an inverse Fourier Transform you will now have normal time domain audio which will match your original source input audio to an arbitrary level of precision ... I know this after writing a golang project which is given an input greyscale photograph which is parsed to synthesize the per pixel light intensity level information into a single channel audio signal (inverse Fourier Transform) which is then listened to (Fourier Transform) to synthesize an output photo which matches the input photo

If you are concerned with bit level accuracy (lossless) you should avoid using mp3 and use a lossless codec or just for starters use WAV format ... any audio CD uses WAV which is just the audio curve in PCM ... its just the points on the audio curve (samples for both channels) ... in your above step 2) if you just feed the audio curve directly into your neural net it will be given your lossless audio data ... the point of typical autoencoders is by definition a lossy transformation since it throws away bit level information

There are several challenges when using audio as input into a Neural Network

1) audio has the aspect of time so depending on what you need you may want to bulk up chunks of audio samples (to make a series of windows of samples) and feed each window as a unit of data into the NN or maybe not

2) As with images, audio has a massive number of data points ... that is each point on the raw audio curve was sampled upstream and now you have typically 44,100 samples per channel per second where semantic meaning is often the result of groupings of these samples ... for instance one spoken word is an aggregate notion easily involving thousands and possibly 10 of thousands of audio sample data points ... so its critical to properly create these windows of audio samples ... bundled into creation of a window of samples is design decision of how the next window will be created : does the next window contain some samples from previous window or are all the samples new ... is the number of audio samples in each window the same or does it vary

So open up the input audio file and read it into a buffer ... to confirm this buffer was created OK just write it out to a file then play back that file and verify its playing OK ... use free open source audio tool called Audacity to open up an audio file and view its audio curve


What I am trying to do is to "separate" vowels from consonants from an audio file (wav file). For example, a file would be this sentence: "I am fine" and I have to separate the vowel sounds from the consonants one. After the "separation", I can ignore the consonants because they have no importance in this project. Also, I have to ignore the pauses in speech (the pauses between words). So this is my problem, how to separate the vowels from consonants.

I was advised that for segmentation I could use a fcm algorithm or the histogram method. I searched these 2 methods, however I could not find something that could help me.

Can someone walk me through the steps I have to do or give me some useful links? I want to mention I can also use some other methods (not necessarily fcm or histograms).



You can use hidden markov model (HMM) based segmentation methods to segment your speech signal into corresponding phonemes. You need correct transcription of the speech signal and letter-to-sound (LTS) rules to do this. Once you segment the speech correctly, you can then separate vowels easily. This link will be useful in this


I want to recognize voices using a neural network, to do that I need to first get a good input for the neural network but by just giving the sound recording as input I don't think it would work because it is based on frequency and time. So I found the Fourier transformation and now I'm trying to transform my audio file with Fourier and plot it.

My questions are:

How can I plot a Fourier transformation with audio input in python? And if that is working, how can I input the Fourier transformation in the neural network (I thought perhaps give every neuron a y value with the neurons as the corresponding x value)

I tried something like (a combination of things I found on the internet:

import matplotlib.pyplot as plt
from import wavfile as wav
from scipy.fftpack import fft
import numpy as np
import wave
import sys

spf ='AAA.wav','r')

#Extract Raw Audio from Wav File
signal = spf.readframes(-1)
signal = np.fromstring(signal, 'Int16')
fs = spf.getframerate()
fft_out = fft(signal)

Time=np.linspace(0, len(signal)/fs, num=len(signal))

plt.title('Signal Wave...')

but considering my input in the mic was 'aaaaaa' it does not seem right.


First of all, your question better fit in Data Science Stack exchange site. Consider asking your question here the next time.

For plotting Fourier Transform, you need the absolute value (modulus) of the fft. (Unless in the particular case the signal is even and real, where the fft is also even and real)

For your inputs, just try to give to the network the amplitude of the fft, for all frequencies or maybe the first frequencies, because in general amplitude decrease fast in fft (or frequencies you think are worth to give after seeing the plot). Maybe it's not a good idea to use fft, but I let you try it. Maybe you could find someone who already tried to make classification with fft. If you have struggle or you're stuck, try to ask another question on the site I linked before.