Hot questions for Using Neural networks in octave

Question:

I created an Octave script for training a neural network with 1 hidden layer using backpropagation but it can not seem to fit an XOR function.

  • x Input 4x2 matrix [0 0; 0 1; 1 0; 1 1]
  • y Output 4x1 matrix [0; 1; 1; 0]
  • theta Hidden / output layer weights
  • z Weighted sums
  • a Activation function applied to weighted sums
  • m Sample count (4 here)

My weights are initialized as follows

epsilon_init = 0.12;
theta1 = rand(hiddenCount, inputCount + 1) * 2 * epsilon_init * epsilon_init;
theta2 = rand(outputCount, hiddenCount + 1) * 2 * epsilon_init * epsilon_init;

Feed forward

a1 = x;
a1_with_bias = [ones(m, 1) a1];
z2 = a1_with_bias * theta1';
a2 = sigmoid(z2);
a2_with_bias = [ones(size(a2, 1), 1) a2];
z3 = a2_with_bias * theta2';
a3 = sigmoid(z3);

Then I compute the logistic cost function

j = -sum((y .* log(a3) + (1 - y) .* log(1 - a3))(:)) / m;

Back propagation

delta2 = (a3 - y);
gradient2 = delta2' * a2_with_bias / m;

delta1 = (delta2 * theta2(:, 2:end)) .* sigmoidGradient(z2);
gradient1 = delta1' * a1_with_bias / m;

The gradients were verified to be correct using gradient checking.

I then use these gradients to find the optimal values for theta using gradient descent, though using Octave's fminunc function yields the same results. The cost function converges to ln(2) (or 0.5 for a squared errors cost function) because the network outputs 0.5 for all four inputs no matter how many hidden units I use.

Does anyone know where my mistake is?


Answer:

Start with a larger range when initialising weights, including negative values. It is difficult for your code to "cross-over" between positive and negative weights, and you probably meant to put * 2 * epsilon_init - epsilon_init; when instead you put * 2 * epsilon_init * epsilon_init;. Fixing that may well fix your code.

As a rule of thumb, I would do something like this:

theta1 = ( 0.5 * sqrt ( 6 / ( inputCount + hiddenCount) ) * 
    randn( hiddenCount, inputCount + 1 ) );
theta2 = ( 0.5 * sqrt ( 6 / ( hiddenCount + outputCount ) ) * 
    randn( outputCount, hiddenCount + 1 ) );

The multiplier is just some advice I picked up on a course, I think that it is backed by a research paper that compared a few different approaches.

In addition, you may need a lot of iterations to learn XOR if you run basic gradient descent. I suggest running for at least 10000 before declaring that learning isn't working. The fminunc function should do better than that.

I ran your code with 2 hidden neurons, basic gradient descent and the above initialisations, and it learned XOR correctly. I also tried adding momentum terms, and the learning was faster and more reliable, so I suggest you take a look at that next.

Question:

I am trying to find optimal parameters of my neural network model implemented on octave, this model is used for binary classification and 122 features (inputs) and 25 hidden units (1 hidden layer). For this I have 4 matrices/ Vectors:

size(X_Train): 125973 x 122
size(Y_Train): 125973 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1

I have used 20% of the training set to generate a validation set (XVal and YVal)

size(X): 100778 x 122
size(Y): 100778 x 1
size(XVal): 25195 x 122
size(YVal): 25195 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1

The goal is to generate the Learning curves of the NN. I have learned (the hard way xD) that this is very time consuming because I used the full size of Xval and X for this.

I don't know if there is an alternative solution for this. I am thinking to reduce the size of the training vector X (like 5000 samples for example), but I don't know if I can do that, or if the results will be biased since I'll only use a portion of the training set?

Bests,


Answer:

The total number of parameters above is around 3k (122*25 + 25*1), which is not huge for one example. Since the number of examples is large, you might want to use stochastic gradient descent or mini-batches instead of gradient descent.

Note that Matlab and Octave are slow in general, specially with loops. You need to write the code which uses matrix operations rather than loops for the speed to be manageable in Matlab/Octave.

Question:

I have a training set of 89 images of 6 different domino tiles plus one "control" group of a baby - all divided over 7 groups. The output y is thus 7. Each image is 100x100 and is black and white, resulting in an X of 100.000.

I am using the 1 hidden layer neural network-code from Andrew Ng's coursera course using Octave. It has been slightly modified.

I first tried this with 3 different groups (two domino tiles, one baby) and it managed to get a near 100% accuracy. I have now increased it to 7 different image groups. The accuracy has dropped WAY down and it is hardly getting anything right but the baby photos (which differ highly from the domino tiles).

I have tried 10 different lambda values, 10 different neuron numbers between 5-20 as well as trying different amount of iterations and plotted it against cost and accuracy in order to find the best fit.

I also tried feature normalization (commented out in the code below) but it didn't help.

This is the code I am using:

% Initialization
clear ; close all; clc; more off;
pkg load image;

fprintf('Running Domino Identifier ... \n');

%iteration_vector = [100, 300, 1000, 3000, 10000, 30000];
%accuracies = [];
%costs = [];

%for iterations_i = 1:length(iteration_vector)

  # INPUTS
  input_layer_size  = 10000;  % 100x100 Input Images of Digits
  hidden_layer_size = 50;   % Hidden units
  num_labels = 7;          % Number of different outputs
  iterations = 100000; % Number of iterations during training
  lambda = 0.13;
  %hidden_layer_size = hidden_layers(hidden_layers_i);
  %lambda = lambdas(lambda_i)
  %iterations = %iteration_vector(iterations_i)

  [X,y] = loadTrainingData(num_labels);
  %[X_norm, mu, sigma] = featureNormalize(X_unnormed);
  %X = X_norm;

  initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
  initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
  initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];

  [J grad] = nnCostFunction(initial_nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

  fprintf('\nTraining Neural Network... \n')

  %  After you have completed the assignment, change the MaxIter to a larger
  %  value to see how more training helps.
  options = optimset('MaxIter', iterations);

  % Create "short hand" for the cost function to be minimized
  costFunction = @(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

  % Now, costFunction is a function that takes in only one argument (the
  % neural network parameters)
  [nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

  % Obtain Theta1 and Theta2 back from nn_params
  Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                   hidden_layer_size, (input_layer_size + 1));

  Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                   num_labels, (hidden_layer_size + 1));

  displayData(Theta1(:, 2:end));
  [predictionData, images] = loadTrainingData(num_labels);
  [h2_training, pred_training] = predict(Theta1, Theta2, predictionData);
  fprintf('\nTraining Accuracy: %f\n', mean(double(pred_training' == y)) * 100);

  %if length(accuracies) > 0
  %  accuracies = [accuracies; mean(double(pred_training' == y))];
  %else
  % accuracies = [mean(double(pred_training' == y))];
  %end

  %last_cost = cost(length(cost));
  %if length(costs) > 0
  %  costs = [costs; last_cost];
  %else
  % costs = [last_cost];
  %end
%endfor % Testing samples

fprintf('Loading prediction images');
[predictionData, images] = loadPredictionData();
[h2, pred] = predict(Theta1, Theta2, predictionData)

for i = 1:length(pred)  
  figure;
  displayData(predictionData(i, :));
  title (strcat(translateIndexToTile(pred(i)), " Certainty:", num2str(max(h2(i, :))*100))); 
  pause;
endfor
%y = provideAnswers(im_vector);

My questions are now:

  1. Are my numbers "off" in terms of a great difference between X and the rest?

  2. What should I do to improve this Neural Network?

  3. If I do feature normalization, do I need to multiply the numbers back to the 0-255 range again somewhere?


Answer:

What should I do to improve this Neural Network?

Use a Convolutional Neural Network (CNN) with multiple layers (e.g., 5 layers). For vision problems, CNNs outperform MLPs by wide margins. Here, you are using an MLP with a single hidden layer. It is plausible that this network will not perform well on an image problem with 7 classes. One concern is the amount of training data that you have. Generally, we want at least hundreds of samples per class.

If I do feature normalization, do I need to multiply the numbers back to the 0-255 range again somewhere?

Generally, not for classification. Normalization can be viewed as a preprocessing step. However, if you working on a problem like image reconstruction, then you would need to convert back to the original domain at the end.

Question:

I am trying to implement a neural network with 3 hidden neurons,

The code causing me trouble is:

  bias = [-1 -1 -1];

  % Output layer
  x3_1 = bias(1,4)*weights(4,1) + x2(1)*weights(4,2) + x2(2)*weights(4,3) + x2(3)*weights(4,4);
  out(j) = sigmoid(x3_1);

I am getting the error:

A(I,J): column index out of bounds; value 4 out of bound 3 error: called from '/home/8.m' in file /home/8.m near line 45, column 12


Answer:

You are trying to access bias(1,4) when bias is initialized to [-1 -1 -1]. It looks like you are missing a step where you update your bias values during each iteration, so they are always going to be [-1 -1 -1].