Hot questions for Using Neural networks in bigdata


I am trying to train neural network on big training set.

inputs consists of aprox 4 million of columns and 128 rows, and targets consisting of 62 rows.

hiddenLayerSize is 128.

The script is follows:

net = patternnet(hiddenLayerSize);
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
net.divideFcn = 'dividerand';  % Divide data randomly
net.divideMode = 'sample';  % Divide up every sample
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
net.trainFcn = 'trainbfg';
net.performFcn = 'mse';  % Mean squared error
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', ...
  'plotregression', 'plotfit'}; = 1;
net.trainParam.showCommandLine = 1;
[net,tr] = train(net,inputs,targets, 'showResources', 'yes', 'reduction', 10);

When train starts to execute, Matlab hangs, Windows hangs or slow, swapping runs disk huge and nothing else happens for dozens of minutes.

Computer is 12Gb Windows x64, Matlab is also 64 bit. Memory usage in process manager varies during operation.

What else can be done except reducing train set?

If reducing train set, then to which level? How to estimate it's size except trying?

Why doesn't function displays anything?


It is fairly hard to diagnose such problems from remote, to the point that I am not even sure that anything anyone can answer might actually help. Moreover you are asking several questions in one so I will take it step by step. Ultimately I will try to give you a better understanding of the memory consumption of your script.

Memory consumption
Dataset Size and Copies

Starting from the size of the dataset you are loading in memory, assuming that each entry contains a double floating-point precision number, your training data set requires (4e6 * 128 * 8) Bytes of memory which roughly resolves to 3.81 GB. If I understand correctly, your array of outputs contains (4e6 * 62) entries which become (4e6 * 62 * 8) Bytes, roughly equivalent to 1,15 GB. So even before running the network training you are consuming circa 5GB of memory.

Now yes MATLAB uses lazy copy so any assignment:

training = zeros(4e6, 128);
copy1 = training;
copy2 = training;

will not require new memory. However, any slicing operation:

training = zeros(4e6, 128);
part1 = training(1:1000, :);
part1 = training(1001:2000, :);

will indeed allocate more memory. Hence when selecting your training, validation and testing subsets:

net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;

internally the train() function could potentially be re-allocating the same amount of memory twice. Your grand total would now be 10GB. If you now consider that you operating system is running, along with a bunch of other applications, it is easy to understand why everything suddenly slows down. I might be telling you something obvious here but: your dataset is very large.

Profiling Helps

Now, whilst I am pretty sure about my 5 GB consumption calculation, I am not sure if this is a valid assumption. Bottom-line is I don't know the inside workings of the train() function that well. This is why I urge you to test it out with MATLAB's very own profiler. This will indeed give you a much better understanding of function calls and memory consumption.

Reducing Memory Usage

What can be done to reduce memory consumption? Now this is probably the question that has been haunting programmers since the dawn of times. :) Once again, it is hard to provide a unique answer as the solution is often dependent on the task, problem and tools at hand. Matlab has a, let's give it the benefit of the doubt, informative page on how to reduce memory usage. Very often though the problem lies in the size of the data to be loaded in memory.

I, on one hand, would of course start by reducing the size of your dataset. Do you really need 4e6 * 128 datapoints? If you do then you might consider investing into dedicated solutions such as high-performance servers to perform your computation. If not you, but only you, must look at your dataset and start analysing which features might be unnecessary, to cut down the columns, and, most importantly, which samples might be unnecessary, to cut down the rows.

Being optimistic

On a side note, you did not complain about any OutOfMemory errors from MATLAB, which could be a good sign. Maybe your machine is simply hanging because the computation is THAT intensive. And this too is a reasonable assumption as you are creating a network with 128 hidden layers, 62 outputs and running several epochs of training, as you should be doing.

Kill The JVM

What you can do to put less load on the machine is to run MATLAB without the Java Environment (JVM). This ensures that MATLAB itself will require less memory to run. The JVM can be disabled by running:

matlab -nojvm

This works if you do not need to display any graphics, as MATLAB will run in a console-like environment.


I have a question of neural network

Let's say I have 60 training, 20 validation, and 20 test set. For each epoch, I run through the 60 training set samples while adjusting the weights on each sample and also calculating the error on each validation sample.

So as I know, Weight updates occur in training set (Not validation set)

But I heard separating validation set from training set is for avoiding over fitting.

Then my question is

If validation doesn't make any weights update in neural network, How can validation set help the neural network avoid overfitting?


As you say it is not used to update the weights of a neural network, but it is used to monitor the progress of training. The first step into preventing overfitting is to detect it, and using a validation set provides an independent measure of how well the network is generalizing outside of the training set.

So for example, you can use the validation set to decide when to stop training (before it starts to overfit). If you do this just remember to use another set (a test set) to produce final evaluation metrics.


i have done a project using django with mysql database to collect sales data i want to make some prediction from the data like demand forecast using ANN/SVM

is it possible to take mysql database as input directly or should i convert them to csv


if your server/machine has enough RAM you can read data directly from MySQL into Pandas DataFrame and then feed that DataFrame to sklearn/Keras/Tensorflow.


from sqlalchemy import create_engine
import pymysql
import pandas as pd

db_connection = 'mysql+pymysql://mysql_user:mysql_password@mysql_host/mysql_db'
conn = create_engine(db_connection)

df = pd.read_sql("select * from tab_name", conn)


I'm working on a project which tries to "learn" a relationship between a set of around 10 k complex-valued input images (amplitude/phase; real/imag) and a real-valued output-vector with 48 entries. This output-vector is not a set of labels, but a set of numbers which represents the best parameters to optimize the visual impression of the given complex-valued image. These parameters are generated by an algorithm. It's possible, that there is some noise in the data (comming from images and from the algorithm which generates the parameter-vector)

Those parameters more-less depends on the FFT (fast-fourier-transform) of the input image. Therfore I was thinking of feeding the network (5 hidden-layers, but architecture shouldn't matter right now) with a 1D-reshaped version of the FFT(complexImage) - some pseudocode:

     // discretize spectrum
     obj_ft = fftshift(fft2(object));

     obj_real_2d = real(obj_ft);
     obj_imag_2d = imag(obj_ft);

     // convert 2D in 1D rows
     obj_real_1d = reshape(obj_real_2d, 1, []);
     obj_imag_1d = reshape(obj_imag_2d, 1, []);

     // create complex variable for 1d object and concat
     obj_complx_1d(index, :) = [obj_real_1d obj_imag_1d];

     opt_param_1D(index, :) = get_opt_param(object);

I was wondering if there is a better approach for feeding complex-valued images into a deep-network. I'd like to avoid the use of complex gradients, because it's not really necessary?! I "just" try to find a "black-box" which outputs the optimized parameters after inserting a new image.

Tensorflow gets the input: obj_complx_1d and output-vector opt_param_1D for training.


There are several ways you can treat complex signals as input.

Use a transform to make them into 'images'. Short Time Fourier Transforms are used to make spectrograms which are 2D. The x-axis being time, y-axis being frequency. If you have complex input data, you may choose to simply look at the magnitude spectrum, or the power spectral density of your transformed data.

Something else that I've seen in practice is to treat the in-phase and quadrature (real/imaginary) channels separate in early layers of the network, and operate across both in higher layers. In the early layers, your network will learn characteristics of each channel, in higher layers it will learn the relationship between the I/Q channels.

These guys do a lot with complex signals and neural nets. In particular check out 'Convolutional Radio Modulation Recognition Networks'